1 - APRENDIZADO DE MÁQUINA - AM#
Prática 1: Visão Geral da Biblioteca Sklearn#
Biblioteca Scikit-Learn#
A biblioteca sklearn é uma das mais utilizadas atualmente, tanto na academia quanto no mercado. O que faz ela tão popular é ser escrita em Python (linguagem muito usada no mercado), possuir um grupo de desenvolvedores ativo e ser Open Source.
A seguir descreveremos as principais funcionalidades da biblioteca, bem como os principais algoritmos desenvolvidos.
Métodos de Classe Mais Usados#
fit (X, [y]): Constrói o modelo a partir do conjunto de treinamento. No caso de algoritmo não supervisonado o y pode ser omitido.
predict (X): Prediz a classe ou valor de regressão.
predict_proba (X): Previsão das probabilidades de classe das amostras de entrada X.
score (X, y[, sample_weight]): Retorna o score para o conjunto de dados:
Para classificação é acurácia
Para regressão é R2
set_params (params): Ajusta os hiperparâmetros do algoritmo.
get_params ([deep]): Retorna os hiperparâmetros do algoritmo.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_moons, make_blobs, make_regression, make_classification
from sklearn.model_selection import train_test_split
np.set_printoptions(suppress=True)
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=42)
model = lrc.fit(X_train, y_train)
preds = model.predict(X_test)
predsp = model.predict_proba(X_test)
scores = model.score(X_test, y_test)
print("Model: ", lrc)
print("Model hps: ", lrc.get_params())
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model: LogisticRegression(random_state=42)
Model hps: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
Model Score: 0.94
Model prediction:
[1 2 2 0 0 0 1 2 2 2]
Model prediction (proba):
[[0.00010604 0.99988101 0.00001294]
[0.00681164 0.00000441 0.99318395]
[0.00124705 0.00063736 0.99811559]
[0.9981574 0.00003683 0.00180576]
[0.99644246 0.00000272 0.00355483]
[0.99873481 0.00000168 0.00126351]
[0.00007422 0.99987744 0.00004834]
[0.00242916 0.00004639 0.99752445]
[0.39413349 0.0837987 0.52206781]
[0.00635346 0.00354697 0.99009957]]
transform: Aplica uma transformação nos dados
fit_transform: Faz fit e a seguir aplica a tranformação nos dados de entrada
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_test_ = scaler.transform(X_test)
X_test_[:10]
array([[ 1.08678365, -1.5164436 ],
[-1.55876784, 0.65593415],
[-1.31288369, -0.37880668],
[ 0.24177302, 1.46105908],
[-0.09670804, 1.81924233],
[ 0.05017578, 1.9255206 ],
[ 0.81429729, -1.62586094],
[-1.46497424, 0.12512164],
[-0.23033232, -0.0316933 ],
[-1.01038457, -0.33584257]])
X_train_ = scaler.fit_transform(X_train)
X_train_[:10]
array([[ 0.29038368, 1.12950646],
[-1.13999693, 0.59901388],
[ 0.01306258, 1.97048737],
[ 1.50049455, -2.55336023],
[ 0.37695627, -1.46288631],
[ 0.49285469, -1.82403643],
[ 0.6848445 , -0.01627551],
[-1.62026329, -0.38417894],
[ 0.97952963, 0.19779602],
[ 1.8370938 , -0.70870969]])
Algoritmos de Classificação#
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'feature-2': X[:, 1],
'target': y
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=features[1], hue=target)
plt.show()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
Logistic Regression (LogisticRegression)#
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=42)
model = lrc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 2 2]
0.94
k-Nearest Neighbors (KNeighborsClassifier)#
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
model = knn.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.9266666666666666
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
Decision Tree (DecisionTreeClassifier)#
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=42)
model = dtc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.9266666666666666
Naive Bayes (GaussianNB)#
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
model = gnb.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 2 2]
0.9333333333333333
Random Forest (RandomForestClassifier)#
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=42)
model = rfc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.94
Support Vector Machine (SVC)#
from sklearn.svm import SVC
svc = SVC(random_state=42)
model = svc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 2 2]
0.9333333333333333
Multi-layer Perceptron classifier (MLPClassifier)#
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state=42)
model = mlp.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.92
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
warnings.warn(
Algoritmos de Regressão#
X, y = make_regression(
n_samples=500,
n_features=10,
n_informative=8,
noise=30,
random_state=1
)
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'target': y
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=target)
plt.show()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
Lasso Regression (Lasso)#
from sklearn.linear_model import Lasso
las = Lasso(random_state=42)
model = las.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[-115.7708748 -46.31684795 -20.86665902 -158.62497355 382.25990628
-124.28940267 68.21063266 -164.95106788 -328.33404498 222.81955835]
0.9469180813122322
Ridge Regression (Ridge)#
from sklearn.linear_model import Ridge
rid = Ridge(random_state=42)
model = rid.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[-122.78992165 -46.12780147 -21.94305706 -162.50410571 387.85141669
-125.93351061 71.01896076 -167.59242227 -334.32584704 225.75003396]
0.9468979075721192
ElasticNet Regression (ElasticNet)#
from sklearn.linear_model import ElasticNet
ent = ElasticNet(random_state=42)
model = ent.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[ -71.19710668 -25.93729254 -15.55881416 -101.25000773 253.68386461
-78.39175189 42.01147833 -104.95807672 -219.87137993 160.11027644]
0.8339397652780292
k-Nearest Neighbors (KNeighborsRegressor)#
from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor()
model = knr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[ -99.6684154 -16.38105864 -50.66401373 -57.41325167 239.25981724
-128.72638105 83.64068093 -146.05781319 -255.81035766 152.25888883]
0.7061793685321537
Decision Tree (DecisionTreeRegressor)#
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(random_state=42)
model = dtr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[ 55.72824398 -34.50897194 -28.79172463 -145.85559682 271.30919082
-70.81129949 171.02917036 -145.85559682 -325.1987638 274.33407544]
0.5107145300278733
Random Forest (RandomForestRegressor)#
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(random_state=42)
model = rfr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[ 24.1702344 -42.91186327 -17.1736669 -139.23683542 218.26798239
-118.29363653 82.02520146 -138.82174793 -263.34837058 231.63570598]
0.8015615355305776
Support Vector Machine (SVR)#
from sklearn.svm import SVR
svr = SVR()
model = svr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[ 0.44496551 -2.50843853 1.67095038 -3.91897052 16.13692994
-7.41046195 6.61101378 -9.16400223 -11.47253449 19.69234491]
0.10988234690190346
Multi-layer Perceptron(MLPRegressor)#
from sklearn.neural_network import MLPRegressor
mlpr = MLPRegressor(random_state=42)
model = mlpr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
print(pred[:10])
print(score)
[-68.09169775 -8.98875781 -19.65923616 -40.95727195 158.65427625
-30.366409 39.17966508 -46.90013887 -99.71012589 106.20396741]
0.5630287207632332
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
warnings.warn(
Algoritmos de Agrupamento#
X, _ = make_blobs(n_samples=500, n_features=2, centers=3, random_state=1)
X = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'feature-2': X[:, 1],
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=features[1])
plt.show()
K-Means (KMeans)#
from sklearn.cluster import KMeans
kmeans = KMeans(random_state=42)
y_ = kmeans.fit_predict(X)
print(y_[:10])
[1 4 7 3 2 3 5 2 5 6]
Agglomerative Clustering (AgglomerativeClustering)#
from sklearn.cluster import AgglomerativeClustering
agc = AgglomerativeClustering()
y_ = agc.fit_predict(X)
print(y_[:10])
[0 0 0 1 0 1 0 0 0 0]
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)#
from sklearn.cluster import DBSCAN
dbs = DBSCAN()
y_ = dbs.fit_predict(X)
print(y_[:10])
[0 0 0 0 0 0 0 0 0 0]
Técnicas de Pré-Processamento & Redução de Dimensionalidade#
Existe uma vasta gama de técnicas de pré-processamento e redução de dimensionalidade no sklearn. Você pode encontrar mais informações sobre elas nos seguintes links:
A seguir destacaremos as mais comuns bem como um exemplo de uso com código:
Imputação para Completar Valores Ausentes (SimpleImputer)#
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
X[:25, 0] = np.nan
X[25:50, 1] = np.nan
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: np.isnan(x).sum(), [X_train, X_test, y_train, y_test]))
[32, 18, 0, 0]
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy="mean")
imp_mean.fit(X_train)
X_train_ = imp_mean.transform(X_train)
X_test_ = imp_mean.transform(X_test)
print("Train nan [count]: ", np.isnan(X_train).sum())
print("Train nan [count]: ", np.isnan(X_test).sum())
print("Train transformed nan [count]: ", np.isnan(X_train_).sum())
print("Test transformed nan [count]: ", np.isnan(X_test_).sum())
Train nan [count]: 32
Train nan [count]: 18
Train transformed nan [count]: 0
Test transformed nan [count]: 0
Dimensiona as Features para um Determinado Intervalo (MinMaxScaler)#
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: (np.min(x), np.max(x)), [X_train, X_test, y_train, y_test]))
[(-4.10970064619185, 6.560510824746599),
(-3.3511606691597353, 6.213852280547424),
(0, 2),
(0, 2)]
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
mm.fit(X_train)
X_train_ = mm.transform(X_train)
X_test_ = mm.transform(X_test)
print("X_train min/max", (np.min(X_train_), np.max(X_train_)))
print("X_test min/max", (np.min(X_test_), np.max(X_test_)))
X_train min/max (0.0, 1.0)
X_test min/max (0.04484147438192271, 0.960195322826444)
Seleção de Features (SelectKBest)#
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 10), (150, 10), (350,), (150,)]
from sklearn.feature_selection import SelectKBest
skb = SelectKBest(k=5)
skb.fit(X_train, y_train)
X_train_ = skb.transform(X_train)
X_test_ = skb.transform(X_test)
print("X_train shape: ", X_train_.shape)
print("X_test shape: ", X_test_.shape)
X_train shape: (350, 5)
X_test shape: (150, 5)
Principal Component Analysis (PCA)#
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 10), (150, 10), (350,), (150,)]
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(X_train)
X_train_ = ss.transform(X_train)
X_test_ = ss.transform(X_test)
pca = PCA(n_components=5, random_state=0)
pca.fit(X_train_)
X_train_pca = pca.transform(X_train_)
X_test_pca = pca.transform(X_test_)
print("X_train shape: ", X_train_pca.shape)
print("X_test shape: ", X_test_pca.shape)
X_train shape: (350, 5)
X_test shape: (150, 5)
Criando Pipelines#
Os pipelines são ferramentas especialmente úteis quando queremos aplicar diferentes técnicas antes da modelagem. Além disso, este pode nos ajudar a evitar escrever longos scripts de código.
O Pipeline aplica sequencialmente uma lista de transformações e um estimador final. Desta maneira, para as etapas intermediárias ele aplica o fit e transform automaticamente.
A seguir exemplificamos como o pipeline funciona.
Sem o uso do Pipeline#
Queremos fazer as seguintes tarefas:
Inputação de valores
Normalização dos dados (z-score)
Redução via PCA para 5 features
Modelagem por meio do Logistic Regression
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=0)
X[:25, 0] = np.nan
X[25:50, 1] = np.nan
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
imp_mean = SimpleImputer(missing_values=np.nan, strategy="mean")
imp_mean.fit(X_train)
X_train_imp = imp_mean.transform(X_train)
X_test_imp = imp_mean.transform(X_test)
ss = StandardScaler()
ss.fit(X_train_imp)
X_train_ss = ss.transform(X_train_imp)
X_test_ss = ss.transform(X_test_imp)
pca = PCA(n_components=5, random_state=0)
pca.fit(X_train_ss)
X_train_pca = pca.transform(X_train_ss)
X_test_pca = pca.transform(X_test_ss)
lr = LogisticRegression(random_state=42)
model = lr.fit(X_train_pca, y_train)
preds = model.predict(X_test_pca)
predsp = model.predict_proba(X_test_pca)
scores = model.score(X_test_pca, y_test)
print("Model: ", lr)
print("Model hps: ", lr.get_params())
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model: LogisticRegression(random_state=42)
Model hps: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
Model Score: 0.7733333333333333
Model prediction:
[1 1 0 0 1 0 0 1 1 0]
Model prediction (proba):
[[0.27950121 0.72049879]
[0.42193184 0.57806816]
[0.96734767 0.03265233]
[0.58694956 0.41305044]
[0.40789539 0.59210461]
[0.98215815 0.01784185]
[0.51741084 0.48258916]
[0.36402651 0.63597349]
[0.25303639 0.74696361]
[0.87368411 0.12631589]]
Com o uso do Pipeline#
Queremos fazer as mesmas tarefas anteriores:
Inputação de valores
Normalização dos dados (z-score)
Redução via PCA para 5 features
Modelagem por meio do Logistic Regression
from sklearn.pipeline import Pipeline
steps = [
('si', SimpleImputer(missing_values=np.nan, strategy="mean")),
('ss', StandardScaler()),
('pca', PCA(n_components=5, random_state=42)),
('lrc', LogisticRegression(random_state=42))
]
pipe = Pipeline(steps)
pipe.fit(X_train, y_train)
preds = pipe.predict(X_test)
predsp = pipe.predict_proba(X_test)
scores = pipe.score(X_test, y_test)
print("Model: ", pipe)
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model: Pipeline(steps=[('si', SimpleImputer()), ('ss', StandardScaler()),
('pca', PCA(n_components=5, random_state=42)),
('lrc', LogisticRegression(random_state=42))])
Model Score: 0.7733333333333333
Model prediction:
[1 1 0 0 1 0 0 1 1 0]
Model prediction (proba):
[[0.27950121 0.72049879]
[0.42193184 0.57806816]
[0.96734767 0.03265233]
[0.58694956 0.41305044]
[0.40789539 0.59210461]
[0.98215815 0.01784185]
[0.51741084 0.48258916]
[0.36402651 0.63597349]
[0.25303639 0.74696361]
[0.87368411 0.12631589]]
Seleção de Modelos#
Métricas e pontuações: quantificando a qualidade das previsões (Evaluation)#
from sklearn.metrics import balanced_accuracy_score
y_true = [0, 1, 0, 0, 1, 0]
y_pred = [0, 1, 0, 0, 0, 1]
balanced_accuracy_score(y_true, y_pred)
0.625
Cross-validation: Avalição da Performance de Modelos (CV)#
Divisão em Treinamento e Teste#
from sklearn.model_selection import train_test_split
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
[i.shape for i in [X, y]]
[(150, 4), (150,)]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
[i.shape for i in [X_train, X_test, y_train, y_test]]
[(105, 4), (45, 4), (105,), (45,)]
Cross Validation#
from sklearn.model_selection import cross_val_score
clf = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(clf, X_train, y_train, cv=10, scoring="balanced_accuracy")
scores
array([0.91666667, 1. , 1. , 0.75 , 0.83333333,
1. , 1. , 0.80555556, 1. , 0.91666667])
Tuning the hyper-parameters of an estimator (HPT)#
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons
X, y = make_moons()
tune_parameters = {
"criterion": ["gini", "entropy"],
"max_depth": np.linspace(1, 32, 32, endpoint=True),
"min_samples_leaf": np.linspace(0.01, 0.1, 5, endpoint=True)
}
clf = GridSearchCV(
estimator=DecisionTreeClassifier(),
param_grid=tune_parameters,
scoring="balanced_accuracy",
cv=10,
refit=True
)
clf.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13.,
14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26.,
27., 28., 29., 30., 31., 32.]),
'min_samples_leaf': array([0.01 , 0.0325, 0.055 , 0.0775, 0.1 ])},
scoring='balanced_accuracy')
clf.score(X_test, y_test)
1.0
Salvando e Carregando modelos#
É possível salvar um modelo do scikit-learn usando o protocolo de serialização e de-serialização de objetos em Python chamado pickle:
Salvando um modelo#
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
lr = LogisticRegression()
model = lr.fit(X_train, y_train)
import pickle
model_name = "my_model.pkl"
with open(model_name, 'wb') as file:
pickle.dump(model, file)
Carregando um Modelo#
import pickle
model_name = "my_model.pkl"
with open(model_name, 'rb') as file:
pickled_model = pickle.load(file)
preds = pickled_model.predict(X_test)
predsp = pickled_model.predict_proba(X_test)
scores = pickled_model.score(X_test, y_test)
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model Score: 1.0
Model prediction:
[0 2 2 2 2 1 1 2 0 1]
Model prediction (proba):
[[0.99985755 0.0000996 0.00004284]
[0.00001745 0.00016558 0.99981698]
[0.00004946 0.00008208 0.99986846]
[0.00004331 0.0001275 0.99982919]
[0.00002165 0.00026253 0.99971582]
[0.00232173 0.99603898 0.00163929]
[0.00024187 0.99971246 0.00004567]
[0.00038217 0.00214574 0.99747209]
[0.99954522 0.00037273 0.00008205]
[0.00190224 0.99772551 0.00037225]]
Referências & Links#
Prática 2: Tarefas de Aprendizado de Máquina#
Imports#
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_blobs, make_regression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.cluster import AgglomerativeClustering
from yellowbrick.contrib.classifier import DecisionViz
from yellowbrick.regressor import ResidualsPlot
from yellowbrick.regressor import PredictionError
# plt.rcParams["figure.figsize"] = (10,10)
np.set_printoptions(suppress=True)
Tarefas de AM#
Existem diversos tipos de Tarefas de Aprendizado de Máquina. A figura a seguir demonstra os mais importantes:
Dentre as mais importantes destacamos a classificação, a regressão e a clusterização.
Entender qual a tarefa mais adequada para um problema é uma das tarefas do Cientista de Dados.
Aprendizado Supervisionado#
É a tarefa que aprende uma função que mapeia uma entrada a uma saída por meio de pares de exemplos de entrada e saída.
Classificação#
Na classificação a saída é discreta, por exemplo:
Se vai chover ou fazer sol;
Se o diagnóstico para uma doença é positivo ou negativo;
Se houve ou não fraude em uma dada transferência bancária;
Se o cliente vai preferir o produto A, B ou C.
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'feature-2': X[:, 1],
'target': y
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=features[1], hue=target);
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 2), (150, 2), (350,), (150,)]
clf = LogisticRegression(random_state=42)
viz = DecisionViz(
clf,
title="Decision Tree",
features=features
)
viz.fit(X_train, y_train)
viz.draw(X_test, y_test)
viz.show()
<AxesSubplot:xlabel='feature-1', ylabel='feature-2'>
clf = DecisionTreeClassifier(
criterion="entropy",
max_depth=5,
min_samples_split=10,
random_state=42
)
viz = DecisionViz(
clf,
title="Decision Tree",
features=features
)
viz.fit(X_train, y_train)
viz.draw(X_test, y_test)
viz.show()
<AxesSubplot:xlabel='feature-1', ylabel='feature-2'>
clf = KNeighborsClassifier(
n_neighbors=5,
metric="euclidean"
)
viz = DecisionViz(
clf,
title="Nearest Neighbors",
features=features
)
viz.fit(X_train, y_train)
viz.draw(X_test, y_test)
viz.show()
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
<AxesSubplot:xlabel='feature-1', ylabel='feature-2'>
Regressão#
Na regressão a saída é contínua, por exemplo:
O valor de uma ação na semana seguinte;
A temperatura do dia seguinte;
A quantidade de produtos vendidos em uma campanha;
O preço de um insumo nos próximos meses.
X, y = make_regression(
n_samples=500,
n_features=2,
n_informative=2,
noise=30,
random_state=1
)
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'feature-2': X[:, 1],
'target': y
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=features[1], hue=target)
sns.relplot(data=df, x=features[0], y=target)
sns.rugplot(data=df, x=features[0], y=target)
sns.relplot(data=df, x=features[1], y=target)
sns.rugplot(data=df, x=features[1], y=target)
<AxesSubplot:xlabel='feature-2', ylabel='target'>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 2), (150, 2), (350,), (150,)]
model = LinearRegression()
visualizer = PredictionError(model)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.show() # Finalize and render the figure
<AxesSubplot:title={'center':'Prediction Error for LinearRegression'}, xlabel='$y$', ylabel='$\\hat{y}$'>
Aprendizado Não-Supervisionado#
Nenhum rótulo é previamente conhecido, desta maneira os dados de entrada são usados para descobrir padrões ocultos e agregá-los de alguma forma.
Agrupamento#
Exemplo:
Segmentação de clientes
Motor de recomendação
Social Network Analysis (SNA)
X, _ = make_blobs(n_samples=500, n_features=4, centers=3, random_state=1)
df = pd.DataFrame(
{
'feature-1': X[:, 0],
'feature-2': X[:, 1],
}
)
features = ["feature-1", "feature-2"]
target = "target"
sns.scatterplot(data=df, x=features[0], y=features[1])
plt.show()
for c in [2, 3, 4]:
cluster = AgglomerativeClustering(n_clusters=c, affinity='euclidean', linkage='ward')
y_ = cluster.fit_predict(X)
sns.scatterplot(data=df, x=features[0], y=features[1], hue = y_)
plt.show()
Prática 3 - Case: Classificação de Câncer por meio de microRNA#
Descrição & Objetivo#
Descrição dos Dados: Os dados foram coletados do The Cancer Genome Atlas (TCGA), que é um programa internacional e de referência mundial de caracterização de mais de 33 tipos de câncer. Os dados são reais e foram devidamente anonimizados. Cada linha representa a amostra retirada de uma pessoa. As colunas são os tipos de microRNA e cada entrada representa a intensidade com que aquele microRNA está expresso. Os valores de expressão variam entre \([0, \infty]\). Valores próximos a zero indicam pouca expressão enquanto que o contrário indica uma alta expressão. Os dados também apresentam rótulos (veja o atributo class) sendo TP (primary solid tumor) indicando tumor e NT (normal tissue).
Objetivo: Construir um modelo para predizer quando uma pessoa tem câncer dado um exame de sequenciamento do RNA.
Leitura dos Dados#
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.
import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.decomposition import PCA
# plt.rcParams["figure.figsize"] = (15,15)
# sns.set(rc={'figure.figsize':(150, 150)})
df = pd.read_csv("brca_mirnaseq.csv", sep=';', header=0, decimal=',')
df
| hsa.let.7a.1 | hsa.let.7a.2 | hsa.let.7a.3 | hsa.let.7b | hsa.let.7c | hsa.let.7d | hsa.let.7e | hsa.let.7f.1 | hsa.let.7f.2 | hsa.let.7g | ... | hsa.mir.941.1 | hsa.mir.942 | hsa.mir.943 | hsa.mir.944 | hsa.mir.95 | hsa.mir.96 | hsa.mir.98 | hsa.mir.99a | hsa.mir.99b | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8962.996542 | 17779.575039 | 9075.200383 | 24749.898857 | 341.298400 | 406.164781 | 1470.179650 | 14.716795 | 3627.642977 | 387.417272 | ... | 0.0 | 5.530515 | 0.187475 | 2.062226 | 4.124452 | 119.984057 | 53.992826 | 130.201449 | 46548.939810 | TP |
| 1 | 7739.739862 | 15524.941906 | 7713.626636 | 23374.640471 | 801.487258 | 513.297924 | 560.962427 | 20.922042 | 6557.093894 | 350.955461 | ... | 0.0 | 8.180047 | 0.000000 | 0.629234 | 1.258469 | 60.249189 | 86.047798 | 236.434808 | 12644.149725 | TP |
| 2 | 8260.612670 | 16497.981335 | 8355.342958 | 10957.355911 | 635.811272 | 620.351816 | 2694.331127 | 39.799878 | 11830.760394 | 600.725980 | ... | 0.0 | 3.618171 | 0.000000 | 0.767491 | 1.644623 | 97.252043 | 117.645369 | 191.434123 | 33083.456616 | TP |
| 3 | 9056.241254 | 18075.168478 | 9097.666150 | 26017.522731 | 2919.348415 | 334.245155 | 1322.434475 | 17.866463 | 6438.725384 | 354.957604 | ... | 0.0 | 3.478426 | 0.000000 | 3.478426 | 1.739213 | 72.572624 | 41.583007 | 1046.690127 | 24067.232290 | TP |
| 4 | 10897.303665 | 21822.338727 | 10963.956320 | 22204.253575 | 3313.009950 | 350.615669 | 1711.886682 | 22.541895 | 8246.117280 | 333.425447 | ... | 0.0 | 2.108235 | 0.000000 | 1.135203 | 0.810860 | 19.947145 | 34.380445 | 1081.037952 | 25715.275426 | TP |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 837 | 10628.975280 | 21125.108661 | 10585.686678 | 23396.813364 | 3892.051211 | 367.141461 | 1484.663795 | 23.402901 | 10570.535667 | 571.680109 | ... | 0.0 | 1.217492 | 0.000000 | 0.405831 | 1.217492 | 79.813361 | 57.627952 | 1100.883277 | 16338.471420 | TP |
| 838 | 16799.785282 | 33603.904432 | 16883.338223 | 20731.006597 | 5263.331356 | 201.676038 | 2173.283559 | 36.888271 | 18227.341203 | 870.301142 | ... | 0.0 | 5.341744 | 0.000000 | 3.124416 | 2.318115 | 16.629958 | 57.348159 | 1919.601107 | 14080.736733 | TP |
| 839 | 13120.807001 | 26337.935723 | 13229.425112 | 18796.895124 | 6581.549565 | 375.598820 | 2547.029500 | 28.505268 | 16838.042944 | 778.398745 | ... | 0.0 | 1.863089 | 0.000000 | 0.558927 | 0.931545 | 41.919511 | 54.215901 | 1310.124456 | 17072.605898 | TP |
| 840 | 7979.531224 | 16006.280243 | 8106.687917 | 20462.010937 | 4040.296936 | 295.594442 | 962.166120 | 23.885025 | 7625.121634 | 428.411748 | ... | 0.0 | 2.070956 | 0.000000 | 2.209020 | 1.656765 | 55.225491 | 53.016472 | 1120.939408 | 18696.866174 | TP |
| 841 | 10439.110392 | 20880.967721 | 10649.126224 | 17770.685685 | 1330.766196 | 790.868182 | 1952.822603 | 29.966587 | 10936.555740 | 577.855691 | ... | 0.0 | 11.487192 | 0.000000 | 3.745823 | 2.746937 | 44.949881 | 80.160621 | 470.225698 | 34080.000799 | TP |
842 rows × 898 columns
df.shape
(842, 898)
Análise Exploratória dos Dados#
ax = sns.countplot(x="class", data=df)
df["class"].value_counts()
TP 755
NT 87
Name: class, dtype: int64
df["class"].value_counts(normalize=True)
TP 0.896675
NT 0.103325
Name: class, dtype: float64
df.describe()
| hsa.let.7a.1 | hsa.let.7a.2 | hsa.let.7a.3 | hsa.let.7b | hsa.let.7c | hsa.let.7d | hsa.let.7e | hsa.let.7f.1 | hsa.let.7f.2 | hsa.let.7g | ... | hsa.mir.940 | hsa.mir.941.1 | hsa.mir.942 | hsa.mir.943 | hsa.mir.944 | hsa.mir.95 | hsa.mir.96 | hsa.mir.98 | hsa.mir.99a | hsa.mir.99b | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | ... | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 | 842.000000 |
| mean | 9218.938921 | 18432.504585 | 9289.250466 | 26606.604836 | 3152.699471 | 558.321269 | 1289.570177 | 24.359962 | 8687.461926 | 610.223836 | ... | 5.902975 | 0.003737 | 6.446279 | 0.061018 | 2.320737 | 3.150482 | 38.307053 | 63.746405 | 1034.572148 | 44369.112203 |
| std | 4843.796136 | 9704.187427 | 4858.691217 | 16745.347957 | 3238.003201 | 346.883205 | 763.056055 | 12.490091 | 6052.615278 | 317.854963 | ... | 8.325681 | 0.049274 | 9.541682 | 0.172214 | 6.527536 | 4.287594 | 33.791795 | 40.145314 | 1117.491608 | 32754.290751 |
| min | 1294.149164 | 2599.981125 | 1319.952907 | 1817.920354 | 148.795934 | 79.783216 | 161.181457 | 2.439034 | 653.474578 | 88.614573 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.374223 | 18.400719 | 3475.079227 |
| 25% | 5902.143848 | 11741.467528 | 5933.706564 | 14580.357100 | 1276.700850 | 330.638301 | 809.867504 | 16.441786 | 4648.822942 | 410.859815 | ... | 1.378098 | 0.000000 | 2.464140 | 0.000000 | 0.373238 | 1.201951 | 14.906921 | 39.913493 | 387.430475 | 22769.094433 |
| 50% | 8016.628565 | 16040.589880 | 8103.783439 | 23097.825936 | 2352.902327 | 481.342371 | 1101.403395 | 21.890340 | 7019.157941 | 532.277053 | ... | 3.192098 | 0.000000 | 4.127957 | 0.000000 | 1.036215 | 2.235731 | 29.634884 | 52.993693 | 710.026124 | 35594.670263 |
| 75% | 11236.887034 | 22538.594950 | 11289.595988 | 34373.185504 | 3971.192192 | 681.931022 | 1619.864372 | 29.395515 | 10926.448322 | 724.277709 | ... | 7.159431 | 0.000000 | 7.551755 | 0.000000 | 2.345941 | 4.030888 | 51.258145 | 75.993914 | 1242.434228 | 53462.034662 |
| max | 45101.697434 | 90233.655610 | 45095.490102 | 144706.427973 | 59677.212349 | 3370.036117 | 11617.011618 | 121.408006 | 80780.055188 | 3342.745045 | ... | 91.996543 | 0.909391 | 184.185656 | 1.757516 | 122.685820 | 93.402785 | 259.127121 | 399.078716 | 15689.499524 | 248074.178531 |
8 rows × 897 columns
Estabelecendo um Baseline Comparativo#
Antes de qualque modelagem vamos estabelecer um baseline, i.e., uma solução simples para o problema. A ideia de adicionarmos um baseline é para guiar o experimento, de forma a compreender se todas as técnicas que estamos usando são realmente necessárias e impactam positivamente a solução. Sem um baseline o experimento acontece de forma cega e não dá para dizer se estamos seguindo o caminho certo.
X = df.drop("class", axis=1)
y = df["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y, random_state=42)
y_train.value_counts(normalize=True)
TP 0.896435
NT 0.103565
Name: class, dtype: float64
y_test.value_counts(normalize=True)
TP 0.897233
NT 0.102767
Name: class, dtype: float64
lrc = LogisticRegression(random_state=42)
cv_list_lr_baseline = cross_val_score(
lrc,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)
print(f"Preformance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Preformance (bac): 0.9201 +- 0.046
Modelagem#
knn = Pipeline(
[
('mms', MinMaxScaler()),
('skb', SelectKBest(chi2, k=10)),
('knn', KNeighborsClassifier(
n_neighbors=3, # número de vizinhos
p=2, # parâmetro da medida Minkowski metric. p=2 é a distância euclideana
weights="uniform", # peso de cada exemplo.
)
)
]
)
cv_list_knn_euclid = cross_val_score(
knn,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_knn_euclid = np.mean(cv_list_knn_euclid)
std_cv_knn_euclid = np.std(cv_list_knn_euclid)
print(f"Preformance (bac): {round(mean_cv_knn_euclid, 4)} +- {round(std_cv_knn_euclid, 4)}")
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
Preformance (bac): 0.9703 +- 0.0377
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
knn = Pipeline(
[
('mms', MinMaxScaler()),
('skb', SelectKBest(chi2, k=10)),
('knn', KNeighborsClassifier(
n_neighbors=3, # número de vizinhos
p=1, # parâmetro da medida Minkowski metric. p=2 é a distância euclideana
weights="uniform", # peso de cada exemplo
)
)
]
)
cv_list_knn_man = cross_val_score(
knn,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_knn_man = np.mean(cv_list_knn_man)
std_cv_knn_man = np.std(cv_list_knn_man)
print(f"Preformance (bac): {round(mean_cv_knn_man, 4)} +- {round(std_cv_knn_man, 4)}")
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
Preformance (bac): 0.9638 +- 0.0407
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
lr = Pipeline(
[
('scaler', StandardScaler()),
('lr', LogisticRegression(
penalty="l2", # penalidade, usado para evitar overfitting
C=1, # força de regularização do modelo. Valores pequenos implicam em regularização mais forte
fit_intercept=True, # bias ou intercepto do modelo
class_weight="balanced", # peso das classes. Útil para datasets desbalanceados
random_state=42
)
)
])
cv_list_lr_l2 = cross_val_score(
lr,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_l2 = np.mean(cv_list_lr_l2)
std_cv_lr_l2 = np.std(cv_list_lr_l2)
print(f"Preformance (bac): {round(mean_cv_lr_l2, 4)} +- {round(std_cv_lr_l2, 4)}")
Preformance (bac): 0.9655 +- 0.0391
lr = Pipeline(
[
('scaler', StandardScaler()),
('lr', LogisticRegression(
penalty="l1", # penalidade, usado para evitar overfitting
C=1, # força de regularização do modelo. Valores pequenos implicam em regularização mais forte
fit_intercept=True, # bias ou intercepto do modelo
class_weight="balanced", # peso das classes. Útil para datasets desbalanceados
solver="liblinear",
random_state=42
)
)
])
cv_list_lr_l1 = cross_val_score(
lr,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_l1 = np.mean(cv_list_lr_l1)
std_cv_lr_l1 = np.std(cv_list_lr_l1)
print(f"Preformance (bac): {round(mean_cv_lr_l1, 4)} +- {round(std_cv_lr_l1, 4)}")
Preformance (bac): 0.9665 +- 0.0373
lr = Pipeline(
[
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('lr', LogisticRegression(
penalty="l2", # penalidade, usado para evitar overfitting
C=1, # força de regularização do modelo. Valores pequenos implicam em regularização mais forte
fit_intercept=True, # bias ou intercepto do modelo
class_weight="balanced", # peso das classes. Útil para datasets desbalanceados
random_state=42
)
)
])
cv_list_lr_pca = cross_val_score(
lr,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_pca = np.mean(cv_list_lr_pca)
std_cv_lr_pca = np.std(cv_list_lr_pca)
print(f"Preformance (bac): {round(mean_cv_lr_pca, 4)} +- {round(std_cv_lr_pca, 4)}")
Preformance (bac): 0.9822 +- 0.0228
Avaliação Experimental#
# resultados da cross-validacao
df_result_cv = pd.DataFrame(
[cv_list_lr_baseline, cv_list_knn_euclid, cv_list_knn_man, cv_list_lr_l2, cv_list_lr_l1, cv_list_lr_pca],
index=["baseline", "kNN-eucli", "kNN-man","LR-L2", "LR-L1", "LR-PCA"]
).T
df_result_cv
| baseline | kNN-eucli | kNN-man | LR-L2 | LR-L1 | LR-PCA | |
|---|---|---|---|---|---|---|
| 0 | 0.907233 | 1.000000 | 0.916667 | 0.990566 | 0.990566 | 0.990566 |
| 1 | 0.990566 | 0.981132 | 0.990566 | 0.888365 | 0.981132 | 0.981132 |
| 2 | 0.971698 | 0.990566 | 0.990566 | 0.990566 | 0.990566 | 0.990566 |
| 3 | 0.907233 | 0.916667 | 0.916667 | 0.916667 | 0.907233 | 0.990566 |
| 4 | 0.907233 | 1.000000 | 1.000000 | 0.990566 | 1.000000 | 1.000000 |
| 5 | 0.916667 | 0.916667 | 0.916667 | 0.916667 | 0.916667 | 0.916667 |
| 6 | 0.907233 | 0.907233 | 0.907233 | 0.981132 | 0.907233 | 0.990566 |
| 7 | 0.878931 | 0.990566 | 1.000000 | 0.990566 | 0.981132 | 0.981132 |
| 8 | 0.980769 | 1.000000 | 1.000000 | 0.990385 | 0.990385 | 0.980769 |
| 9 | 0.833333 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
| fold | pipelines | balanced_accuracy | |
|---|---|---|---|
| 0 | 0 | baseline | 0.907233 |
| 1 | 0 | kNN-eucli | 1.000000 |
| 2 | 0 | kNN-man | 0.916667 |
| 3 | 0 | LR-L2 | 0.990566 |
| 4 | 0 | LR-L1 | 0.990566 |
| 5 | 0 | LR-PCA | 0.990566 |
| 6 | 1 | baseline | 0.990566 |
| 7 | 1 | kNN-eucli | 0.981132 |
| 8 | 1 | kNN-man | 0.990566 |
| 9 | 1 | LR-L2 | 0.888365 |
| 10 | 1 | LR-L1 | 0.981132 |
| 11 | 1 | LR-PCA | 0.981132 |
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
plt.figure(figsize=(10,10))
sns.catplot(x="pipelines", y="balance
d_accuracy", kind="violin", data=df_res)
<seaborn.axisgrid.FacetGrid at 0x7f218c1d0fd0>
<Figure size 720x720 with 0 Axes>
O modelo que vamos selecionar será o Logistic Regression com PCA, pois este apresentou média/mediana competitiva e menor desvio padrão.
Por fim, vamos avaliar a performance final do nosso modelo:
# retreinar o pipeline selecionado com todos os dados de treinamento
lr = Pipeline(
[
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('lr', LogisticRegression(
penalty="l2",
C=1,
fit_intercept=True,
class_weight="balanced",
random_state=42
)
)
])
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
lr_pca_test = balanced_accuracy_score(y_test, y_pred)
print("Performance: ", round(lr_pca_test, 4))
Performance: 0.972
# Confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(lr, X_test, y_test)
plt.show()
ConfusionMatrixDisplay.from_estimator(lr, X_test, y_test, normalize='true')
plt.show()
Referências & Links#
Prática 4: Classificação de medicamentos usando árvore de decisão#
Descrição & Objetivo#
Descrição dos Dados: Uma empresa farmacêutica desenvolveu e testou dois diferentes medicamentos para o tratamento de uma doença. Os pesquisadores perceberam que o remédio A se comportava melhor para alguns pacientes enquanto que o B foi melhor para outro grupo de pacientes. Foram coletadas as seguintes características dos pacientes: idade (Age), sexo (Sex), pressão sanguínea (BP) e nível de colesterol (Cholesterol). Você foi acionado pela equipe para construir uma solução automática para recomendar o melhor medicamento. Contudo, como se trata de um medicamento, é esperado que esta recomendação seja transparente, i.e., o paciente precisa entender exatamente o motivo de tal recomendação.
Objetivo: Construir regras claras e bem definidas para recomendar o melhor medicamento dado as características do paciente.
1. Leitura dos Dados#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
# Leitura de dados
data = pd.read_csv("drug_data.csv")
data
| Age | Sex | BP | Cholesterol | Drug | |
|---|---|---|---|---|---|
| 0 | 23 | F | HIGH | HIGH | B |
| 1 | 28 | F | NORMAL | HIGH | A |
| 2 | 61 | F | LOW | HIGH | B |
| 3 | 22 | F | NORMAL | HIGH | A |
| 4 | 49 | F | NORMAL | HIGH | B |
| ... | ... | ... | ... | ... | ... |
| 140 | 72 | M | LOW | HIGH | B |
| 141 | 46 | F | HIGH | HIGH | B |
| 142 | 52 | M | NORMAL | HIGH | A |
| 143 | 23 | M | NORMAL | NORMAL | A |
| 144 | 40 | F | LOW | NORMAL | A |
145 rows × 5 columns
2. Análise Exploratória dos Dados#
# Contar classes
data["Drug"].value_counts()
B 91
A 54
Name: Drug, dtype: int64
# Porcentagem de exemplos de cada classe
data["Drug"].value_counts(True)
B 0.627586
A 0.372414
Name: Drug, dtype: float64
# Verificar se tem NaN
data.isna().sum()
Age 0
Sex 0
BP 0
Cholesterol 0
Drug 0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
| Sex | BP | Cholesterol | Drug | |
|---|---|---|---|---|
| count | 145 | 145 | 145 | 145 |
| unique | 2 | 3 | 2 | 2 |
| top | F | NORMAL | NORMAL | B |
| freq | 74 | 59 | 78 | 91 |
# Descrever dados numéricos
data.describe(include=[np.number])
| Age | |
|---|---|
| count | 145.000000 |
| mean | 43.848276 |
| std | 16.755319 |
| min | 15.000000 |
| 25% | 30.000000 |
| 50% | 43.000000 |
| 75% | 58.000000 |
| max | 74.000000 |
3. Modelagem & Avaliação#
3.1 Baseline#
# Separar os dados em treinamento e teste
categorical_features = ["Sex", "BP", "Cholesterol"]
numerical_features = ["Age"]
X = data.drop(columns=["Drug"])
y = data["Drug"].astype('category').cat.codes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
lrc = LogisticRegression(random_state=42)
cv_list_lr_baseline = cross_val_score(
lrc,
X_train[numerical_features],
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)
print(f"Performance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (bac): 0.5 +- 0.0
3.2 Tratando Dados Categóricos#
Embora os algoritmos de árvore de decisão tenham sido inicialmente criados para lidar com dados categóricos, os algoritmos do sklearn só trabalham com dados numéricos. Portanto, precisamos transformar os atributos categóricos para numéricos usando, por exemplo, o OneHotEncoder ou OrdinalEncoder.
# Esse comando vai transformar as categorias em números
oe = OrdinalEncoder()
oe.fit(data[["Sex", "BP"]])
oe.transform(data[["Sex", "BP"]])[:10]
array([[0., 0.],
[0., 2.],
[0., 1.],
[0., 2.],
[0., 2.],
[1., 2.],
[1., 1.],
[0., 0.],
[1., 1.],
[0., 1.]])
# Podemos transformar as features de forma diferente
# Para isso podemos usar o ColumnTransformer
tr = ColumnTransformer(
transformers=[
("cat-enc", OrdinalEncoder(), categorical_features),
# ("min-max", MinMaxScaler(), numerical_features)
]
)
tr.fit_transform(X_train)[:10]
array([[0., 2., 1.],
[1., 1., 1.],
[0., 1., 1.],
[0., 0., 1.],
[1., 1., 1.],
[1., 2., 1.],
[0., 1., 1.],
[0., 0., 0.],
[0., 2., 0.],
[0., 0., 0.]])
# Podemos transformar as features de forma diferente
# Para isso podemos usar o ColumnTransformer
tr = ColumnTransformer(
transformers=[
("cat-enc", OrdinalEncoder(), categorical_features),
("min-max", MinMaxScaler(), numerical_features)
]
)
tr.fit_transform(X_train)[:10]
array([[0. , 2. , 1. , 0.08474576],
[1. , 1. , 1. , 0.50847458],
[0. , 1. , 1. , 0.61016949],
[0. , 0. , 1. , 0.06779661],
[1. , 1. , 1. , 0.57627119],
[1. , 2. , 1. , 0.13559322],
[0. , 1. , 1. , 0.08474576],
[0. , 0. , 0. , 0.72881356],
[0. , 2. , 0. , 0.42372881],
[0. , 0. , 0. , 0.45762712]])
# Podemos transformar as features de forma diferente
# Para isso podemos usar o ColumnTransformer
tr = ColumnTransformer(
transformers=[
("cat-enc", OrdinalEncoder(), categorical_features),
# ("min-max", MinMaxScaler(), numerical_features)
],
remainder='passthrough' # passa para frente demais features
)
tr.fit_transform(X_train)[:10]
array([[ 0., 2., 1., 20.],
[ 1., 1., 1., 45.],
[ 0., 1., 1., 51.],
[ 0., 0., 1., 19.],
[ 1., 1., 1., 49.],
[ 1., 2., 1., 23.],
[ 0., 1., 1., 20.],
[ 0., 0., 0., 58.],
[ 0., 2., 0., 40.],
[ 0., 0., 0., 42.]])
3.2 Modelando uma árvore de decisão#
Para isso, podemos usar a classe DecisionTreeClassifier, que gera modelos do algoritmo DT.
Observe que é necessário configurar os parâmetros que controlam o tamanho da árvore para evitar overfitting. Além disso, perdemos o entendimento do modelo ao criar árvores muito compridas. Veja a nota na documentação do algoritmo:
Notes
The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.
Sem limitar tamanho da árvore#
categorical_features = ["Sex", "BP", "Cholesterol"]
ct = ColumnTransformer(
transformers=[
("cat", OrdinalEncoder(), categorical_features),
],
remainder='passthrough' # passa para frente demais features
)
dt = DecisionTreeClassifier(
criterion="gini", # critério para medir a qualidade da divisão (split)
class_weight="balanced", # atribuir o mesmo peso para as duas classes
random_state=42
)
pipe1 = Pipeline([
('preprocessing-1', ct),
('model', dt)
])
cv_list_dt = cross_val_score(
pipe1,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_dt = np.mean(cv_list_dt)
std_cv_dt = np.std(cv_list_dt)
print(f"Performance (bac): {round(mean_cv_dt, 4)} +- {round(std_cv_dt, 4)}")
Performance (bac): 0.6643 +- 0.1602
pipe1.fit(X_train, y_train)
model = pipe1["model"]
X_train_prep = pipe1["preprocessing-1"].transform(X_train)
feature_names = pipe1["preprocessing-1"].transformers_[0][2] + X_train.columns[pipe1["preprocessing-1"].transformers_[1][2]].to_list()
class_names = ["A", "B"]
plt.figure(figsize=(15,10))
plot_tree(
model.fit(X_train_prep, y_train),
feature_names=feature_names,
class_names=class_names,
filled=True,
proportion=True
)
plt.show(True)
Limitando o tamanho da árvore#
categorical_features = ["Sex", "BP", "Cholesterol"]
ct = ColumnTransformer(
transformers=[
("cat", OrdinalEncoder(), categorical_features),
],
remainder='passthrough' # passa para frente demais features
)
dt = DecisionTreeClassifier(
criterion="gini", # critério para medir a qualidade da divisão (split)
max_depth=3, # profundidade máxima da árvore
min_samples_leaf=0.01, # número mínimo de folhas em um nó - aqui no caso em porcentagem = 1%
class_weight="balanced", # atribuir o mesmo peso para as duas classes
random_state=42
)
pipe2 = Pipeline([
('preprocessing-1', ct),
('model', dt)
])
cv_list_dt2 = cross_val_score(
pipe2,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_dt2 = np.mean(cv_list_dt2)
std_cv_dt2 = np.std(cv_list_dt2)
print(f"Performance (bac): {round(mean_cv_dt2, 4)} +- {round(std_cv_dt2, 4)}")
Performance (bac): 0.7845 +- 0.1012
pipe2.fit(X_train, y_train)
model = pipe2["model"]
X_train_prep = pipe2["preprocessing-1"].transform(X_train)
feature_names = pipe2["preprocessing-1"].transformers_[0][2] + X_train.columns[pipe2["preprocessing-1"].transformers_[1][2]].to_list()
class_names = ["A", "B"]
plt.figure(figsize=(15,10))
plot_tree(
model.fit(X_train_prep, y_train),
feature_names=feature_names,
class_names=class_names,
filled=True,
proportion=True
)
plt.show(True)
# retreinar o pipeline selecionado com todos os dados de treinamento
pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
lr_pca_test = balanced_accuracy_score(y_test, y_pred)
print("Performance: ", round(lr_pca_test, 4))
Performance: 0.8036
4. DTreeViz (Decision Tree Visualization)#
# descomente o comando abaixo para instalar a dtreeviz
!pip install dtreeviz
Looking in indexes: https://SefazFeed:****@ads.intra.fazenda.sp.gov.br/tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/
WARNING: Ignoring invalid distribution -cipy (c:\users\mlfernandez\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -illow (c:\users\mlfernandez\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -tatsmodels (c:\users\mlfernandez\anaconda3\lib\site-packages)
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5130>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5400>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA55B0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5760>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5910>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
ERROR: Could not find a version that satisfies the requirement dtreeviz (from versions: none)
ERROR: No matching distribution found for dtreeviz
from sklearn.datasets import load_iris
import dtreeviz
import logging
logging.getLogger('matplotlib.font_manager').disabled = True
classifier = DecisionTreeClassifier(max_depth=3)
iris = load_iris()
classifier.fit(iris.data, iris.target)
viz = dtreeviz.model(classifier,
iris.data,
iris.target,
target_name='variety',
feature_names=iris.feature_names,
class_names=["setosa", "versicolor", "virginica"]
)
viz.view(scale=1.4)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_8884\1144491373.py in <module>
1 from sklearn.datasets import load_iris
----> 2 import dtreeviz
3 import logging
4
5 logging.getLogger('matplotlib.font_manager').disabled = True
ModuleNotFoundError: No module named 'dtreeviz'
viz.view(x=iris.data[100], scale=1.4)
Referências & Links#
Prática 5a: Predição de abandono de produto de crédito usando o algoritmo Support Vector Machine#
Descrição & Objetivo#
Descrição dos Dados: Uma empresa possui um produto de cartão de crédito muito conhecido no mercado. Contudo, dada a alta competitividade, a empresa percebeu que seus clientes começaram a abandonar seu produto. A empresa contratou você para modelar quais clientes provavelmente irão abandonar o produto, para que ela interfira proativamente evitando a perda dos clientes. Os dados disponibilizados possuem diversas características dos clientes (e.g., sexo, idade, estado civil, escolaridade) e do uso do produto de crédito (e.g., limite do cartão, tipo do cartão). A coluna Attrition_Flag indica se o cliente abandonou ou não o produto.
Objetivo: Modelar o problema com SVM para classificar quais clientes provavelmente irão abandonar o produto.
1. Leitura dos Dados#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# Leitura de dados
data = pd.read_csv("credit_card_churn_data.csv")
target = 'Attrition_Flag'
categorical_features = ['Gender', 'Education_Level',
'Marital_Status', 'Income_Category', 'Card_Category'
]
numerical_features = ['Dependent_count', 'Customer_Age', 'Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio'
]
data
| Dependent_count | Customer_Age | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender | Education_Level | Marital_Status | Income_Category | Card_Category | Attrition_Flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 45 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 | M | High School | Married | $60K - $80K | Blue | Existing Customer |
| 1 | 5 | 49 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 | F | Graduate | Single | Less than $40K | Blue | Existing Customer |
| 2 | 3 | 51 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 | M | Graduate | Married | $80K - $120K | Blue | Existing Customer |
| 3 | 4 | 40 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 | F | High School | Unknown | Less than $40K | Blue | Existing Customer |
| 4 | 3 | 40 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 | M | Uneducated | Married | $60K - $80K | Blue | Existing Customer |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10122 | 2 | 50 | 40 | 3 | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 | M | Graduate | Single | $40K - $60K | Blue | Existing Customer |
| 10123 | 2 | 41 | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 | M | Unknown | Divorced | $40K - $60K | Blue | Attrited Customer |
| 10124 | 1 | 44 | 36 | 5 | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 | F | High School | Married | Less than $40K | Blue | Attrited Customer |
| 10125 | 2 | 30 | 36 | 4 | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 | M | Graduate | Unknown | $40K - $60K | Blue | Attrited Customer |
| 10126 | 2 | 43 | 25 | 6 | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 | F | Graduate | Married | Less than $40K | Silver | Attrited Customer |
10127 rows × 20 columns
2. Análise Exploratória dos Dados#
# Número de exemplos e features
data.shape
(10127, 20)
# Contar classes
data[target].value_counts()
Attrition_Flag
Existing Customer 8500
Attrited Customer 1627
Name: count, dtype: int64
# Porcentagem de exemplos de cada classe
data[target].value_counts(normalize=True)
Attrition_Flag
Existing Customer 0.83934
Attrited Customer 0.16066
Name: proportion, dtype: float64
# Verificar se tem NaN
data.isna().sum()
Dependent_count 0
Customer_Age 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
Gender 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Attrition_Flag 0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
| Gender | Education_Level | Marital_Status | Income_Category | Card_Category | Attrition_Flag | |
|---|---|---|---|---|---|---|
| count | 10127 | 10127 | 10127 | 10127 | 10127 | 10127 |
| unique | 2 | 7 | 4 | 6 | 4 | 2 |
| top | F | Graduate | Married | Less than $40K | Blue | Existing Customer |
| freq | 5358 | 3128 | 4687 | 3561 | 9436 | 8500 |
# Descrever dados numéricos
data.describe(include=[np.number])
| Dependent_count | Customer_Age | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 |
| mean | 2.346203 | 46.325960 | 35.928409 | 3.812580 | 2.341167 | 2.455317 | 8631.953698 | 1162.814061 | 7469.139637 | 0.759941 | 4404.086304 | 64.858695 | 0.712222 | 0.274894 |
| std | 1.298908 | 8.016814 | 7.986416 | 1.554408 | 1.010622 | 1.106225 | 9088.776650 | 814.987335 | 9090.685324 | 0.219207 | 3397.129254 | 23.472570 | 0.238086 | 0.275691 |
| min | 0.000000 | 26.000000 | 13.000000 | 1.000000 | 0.000000 | 0.000000 | 1438.300000 | 0.000000 | 3.000000 | 0.000000 | 510.000000 | 10.000000 | 0.000000 | 0.000000 |
| 25% | 1.000000 | 41.000000 | 31.000000 | 3.000000 | 2.000000 | 2.000000 | 2555.000000 | 359.000000 | 1324.500000 | 0.631000 | 2155.500000 | 45.000000 | 0.582000 | 0.023000 |
| 50% | 2.000000 | 46.000000 | 36.000000 | 4.000000 | 2.000000 | 2.000000 | 4549.000000 | 1276.000000 | 3474.000000 | 0.736000 | 3899.000000 | 67.000000 | 0.702000 | 0.176000 |
| 75% | 3.000000 | 52.000000 | 40.000000 | 5.000000 | 3.000000 | 3.000000 | 11067.500000 | 1784.000000 | 9859.000000 | 0.859000 | 4741.000000 | 81.000000 | 0.818000 | 0.503000 |
| max | 5.000000 | 73.000000 | 56.000000 | 6.000000 | 6.000000 | 6.000000 | 34516.000000 | 2517.000000 | 34516.000000 | 3.397000 | 18484.000000 | 139.000000 | 3.714000 | 0.999000 |
3. Modelagem & Avaliação#
3.1 Baseline#
# Separar os dados em treinamento e teste
X = data.drop(columns=[target])
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
# Transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# zscore para mudar escala os dados
scaler = StandardScaler()
# modelo baseline
lrc = LogisticRegression(random_state=42)
# pipeline de machine learning
pipeb = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', lrc)
])
# cross validação da solução
cv_list_lr_baseline = cross_val_score(
pipeb,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)
print(f"Performance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (bac): 0.7834 +- 0.0227
3.2 Modelando com SVM#
n_samples = 1500
X_, Y_ = datasets.make_circles(n_samples=n_samples, factor=0.5, noise=0.05)
Y_ = ["c1" if i == 0 else "c2" for i in Y_]
aux_data = pd.DataFrame({"x": X_[:, 0], "y": X_[:, 1], "color": Y_})
fig = px.scatter(aux_data, x="x", y="y", color="color", opacity=0.5)
fig.update_layout(autosize=False, width=600, height=600)
fig.show()
Mapeando os dados em uma função polinomial de segundo grau:
\( k(x, y) = \begin{pmatrix} x^2\\ \sqrt{2}xy\\ y^2\\ \end{pmatrix} \)
def k(x, y):
return x**2, np.sqrt(2)*x*y, y**2
x, y, z = k(X_[:,0], X_[:, 1])
aux_data = pd.DataFrame({"x": x, "y": y, "z": z, "color": Y_})
fig = px.scatter_3d(aux_data, x="x", y="y", z="z", color="color", opacity=0.5)
fig.update_layout(autosize=False, width=600, height=600)
fig.show()
Kernel Linear#
O kernel linear é definido no sklearn como:
\(\left \langle x, x' \right \rangle\)
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo SVM
svc = SVC(
C=1.0, # C é o hiperparâmetro de regularização.
# Ele controla o trade-off entre o limite de decisão suave e a
# classificação correta dos pontos de treinamento.
# O aumento dos valores de C pode levar a um ajuste excessivo dos dados de treinamento.
kernel="linear", # tipo do kernel
class_weight="balanced", # atribuir o mesmo peso para as duas classes
max_iter=100000, # número de iterações do otimizador
random_state=42
)
# pipeline de machine learning
pipe1 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', svc)
])
# cross validação da solução
cv_list_pipe1_baseline = cross_val_score(
pipe1,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe1_baseline = np.mean(cv_list_pipe1_baseline)
std_cv_pipe1_baseline = np.std(cv_list_pipe1_baseline)
print(f"Performance (bac): {round(mean_cv_pipe1_baseline, 4)} +- {round(std_cv_pipe1_baseline, 4)}")
Performance (bac): 0.855 +- 0.0119
Kernel Polinomial#
O kernel polinomial é definido no sklearn como:
\(\left ( \gamma \left \langle x, x' \right \rangle + r \right )^d\)
onde, \(d\) é especificado pelo hyperparametro degree, \(r\) por coef0 e \(\gamma\) por gamma.
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo SVM
svc = SVC(
C=1.0, # C é o hiperparâmetro de regularização.
# Ele controla o trade-off entre o limite de decisão suave e a
# classificação correta dos pontos de treinamento.
# O aumento dos valores de C pode levar a um ajuste excessivo dos dados de treinamento.
kernel="poly", # tipo do kernel
degree=3, # grau do kernel polinomial
coef0=1, # termo independente da função kernel
gamma="scale", # coeficiente gamma da função kernel
# scale = 1 / (n_features * X.var())
# auto = 1 / n_features
# podemos usar float > 0
class_weight="balanced", # atribuir o mesmo peso para as duas classes
max_iter=100000, # número de iterações do otimizador
random_state=42
)
# pipeline de machine learning
pipe2 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', svc)
])
# cross validação da solução
cv_list_pipe2_baseline = cross_val_score(
pipe2,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe2_baseline = np.mean(cv_list_pipe2_baseline)
std_cv_pipe2_baseline = np.std(cv_list_pipe2_baseline)
print(f"Performance (bac): {round(mean_cv_pipe2_baseline, 4)} +- {round(std_cv_pipe2_baseline, 4)}")
Performance (bac): 0.8692 +- 0.0162
Kernel RBF#
O kernel rbf é definido no sklearn como:
\(exp(-\gamma \left \| x - x' \right\|^2)\)
onde, \(\gamma\) é especificado pelo hyperparametro gamma e deve ser \(\geq 0\).
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo SVM
svc = SVC(
C=1.0, # C é o hiperparâmetro de regularização.
# Ele controla o trade-off entre o limite de decisão suave e a
# classificação correta dos pontos de treinamento.
# O aumento dos valores de C pode levar a um ajuste excessivo dos dados de treinamento.
kernel="rbf", # tipo do kernel
gamma="scale", # coeficiente gamma da função kernel
# scale = 1 / (n_features * X.var())
# auto = 1 / n_features
# podemos usar float > 0
class_weight="balanced", # atribuir o mesmo peso para as duas classes
max_iter=100000, # número de iterações do otimizador
random_state=42
)
# pipeline de machine learning
pipe3 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', svc)
])
# cross validação da solução
cv_list_pipe3_baseline = cross_val_score(
pipe3,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe3_baseline = np.mean(cv_list_pipe3_baseline)
std_cv_pipe3_baseline = np.std(cv_list_pipe3_baseline)
print(f"Performance (bac): {round(mean_cv_pipe3_baseline, 4)} +- {round(std_cv_pipe3_baseline, 4)}")
Performance (bac): 0.857 +- 0.0176
Avaliação Experimental#
# resultados da cross-validação
df_result_cv = pd.DataFrame(
[cv_list_lr_baseline, cv_list_pipe1_baseline, cv_list_pipe2_baseline, cv_list_pipe3_baseline],
index=["baseline","SVM-linear", "SVM-poly", "SMV-rbf"]
).T
df_result_cv
| baseline | SVM-linear | SVM-poly | SMV-rbf | |
|---|---|---|---|---|
| 0 | 0.761374 | 0.846160 | 0.842349 | 0.824149 |
| 1 | 0.765576 | 0.847759 | 0.867743 | 0.861861 |
| 2 | 0.768097 | 0.881166 | 0.889017 | 0.884159 |
| 3 | 0.777053 | 0.856347 | 0.881925 | 0.866431 |
| 4 | 0.777053 | 0.854666 | 0.861020 | 0.839275 |
| 5 | 0.782279 | 0.855587 | 0.862045 | 0.859052 |
| 6 | 0.836776 | 0.867168 | 0.891722 | 0.862413 |
| 7 | 0.777893 | 0.860077 | 0.864934 | 0.869505 |
| 8 | 0.815479 | 0.839403 | 0.847289 | 0.834308 |
| 9 | 0.772090 | 0.841585 | 0.884272 | 0.868476 |
# linearizar matriz
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
| fold | pipelines | balanced_accuracy | |
|---|---|---|---|
| 0 | 0 | baseline | 0.761374 |
| 1 | 0 | SVM-linear | 0.846160 |
| 2 | 0 | SVM-poly | 0.842349 |
| 3 | 0 | SMV-rbf | 0.824149 |
| 4 | 1 | baseline | 0.765576 |
| 5 | 1 | SVM-linear | 0.847759 |
| 6 | 1 | SVM-poly | 0.867743 |
| 7 | 1 | SMV-rbf | 0.861861 |
| 8 | 2 | baseline | 0.768097 |
| 9 | 2 | SVM-linear | 0.881166 |
| 10 | 2 | SVM-poly | 0.889017 |
| 11 | 2 | SMV-rbf | 0.884159 |
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
# retreinar o pipeline selecionado com todos os dados de treinamento
pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
bac = balanced_accuracy_score(y_test, y_pred)
print("Performance: ", round(bac, 4))
Performance: 0.8633
Referências & Links#
Prática 5b: Predição de abandono de produto de crédito usando o algoritmo Multilayer Perceptron#
Descrição & Objetivo#
Descrição dos Dados: Uma empresa possui um produto de cartão de crédito muito conhecido no mercado. Contudo, dada a alta competitividade, a empresa percebeu que seus clientes começaram a abandonar seu produto. A empresa contratou você para modelar quais clientes provavelmente irão abandonar o produto, para que ela interfira proativamente evitando a perda dos clientes. Os dados disponibilizados possuem diversas características dos clientes (e.g., sexo, idade, estado civil, escolaridade) e do uso do produto de crédito (e.g., limite do cartão, tipo do cartão). A coluna Attrition_Flag indica se o cliente abandonou ou não o produto.
Objetivo: Modelar o problema com MLP de forma a retornar uma lista ordenada de clientes, do mais ao menos provável a abandonar o produto.
1. Leitura dos Dados#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
# Leitura de dados
data = pd.read_csv("credit_card_churn_data.csv")
target = 'Attrition_Flag'
categorical_features = ['Gender', 'Education_Level',
'Marital_Status', 'Income_Category', 'Card_Category'
]
numerical_features = ['Dependent_count', 'Customer_Age', 'Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio'
]
data
| Dependent_count | Customer_Age | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender | Education_Level | Marital_Status | Income_Category | Card_Category | Attrition_Flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 45 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 | M | High School | Married | $60K - $80K | Blue | Existing Customer |
| 1 | 5 | 49 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 | F | Graduate | Single | Less than $40K | Blue | Existing Customer |
| 2 | 3 | 51 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 | M | Graduate | Married | $80K - $120K | Blue | Existing Customer |
| 3 | 4 | 40 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 | F | High School | Unknown | Less than $40K | Blue | Existing Customer |
| 4 | 3 | 40 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 | M | Uneducated | Married | $60K - $80K | Blue | Existing Customer |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10122 | 2 | 50 | 40 | 3 | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 | M | Graduate | Single | $40K - $60K | Blue | Existing Customer |
| 10123 | 2 | 41 | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 | M | Unknown | Divorced | $40K - $60K | Blue | Attrited Customer |
| 10124 | 1 | 44 | 36 | 5 | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 | F | High School | Married | Less than $40K | Blue | Attrited Customer |
| 10125 | 2 | 30 | 36 | 4 | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 | M | Graduate | Unknown | $40K - $60K | Blue | Attrited Customer |
| 10126 | 2 | 43 | 25 | 6 | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 | F | Graduate | Married | Less than $40K | Silver | Attrited Customer |
10127 rows × 20 columns
2. Análise Exploratória dos Dados#
# Número de exemplos e features
data.shape
(10127, 20)
# Contar classes
data[target].value_counts()
Existing Customer 8500
Attrited Customer 1627
Name: Attrition_Flag, dtype: int64
# Porcentagem de exemplos de cada classe
data[target].value_counts(normalize=True)
Existing Customer 0.83934
Attrited Customer 0.16066
Name: Attrition_Flag, dtype: float64
# Verificar se tem NaN
data.isna().sum()
Dependent_count 0
Customer_Age 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0
Gender 0
Education_Level 0
Marital_Status 0
Income_Category 0
Card_Category 0
Attrition_Flag 0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
| Gender | Education_Level | Marital_Status | Income_Category | Card_Category | Attrition_Flag | |
|---|---|---|---|---|---|---|
| count | 10127 | 10127 | 10127 | 10127 | 10127 | 10127 |
| unique | 2 | 7 | 4 | 6 | 4 | 2 |
| top | F | Graduate | Married | Less than $40K | Blue | Existing Customer |
| freq | 5358 | 3128 | 4687 | 3561 | 9436 | 8500 |
# Descrever dados numéricos
data.describe(include=[np.number])
| Dependent_count | Customer_Age | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 |
| mean | 2.346203 | 46.325960 | 35.928409 | 3.812580 | 2.341167 | 2.455317 | 8631.953698 | 1162.814061 | 7469.139637 | 0.759941 | 4404.086304 | 64.858695 | 0.712222 | 0.274894 |
| std | 1.298908 | 8.016814 | 7.986416 | 1.554408 | 1.010622 | 1.106225 | 9088.776650 | 814.987335 | 9090.685324 | 0.219207 | 3397.129254 | 23.472570 | 0.238086 | 0.275691 |
| min | 0.000000 | 26.000000 | 13.000000 | 1.000000 | 0.000000 | 0.000000 | 1438.300000 | 0.000000 | 3.000000 | 0.000000 | 510.000000 | 10.000000 | 0.000000 | 0.000000 |
| 25% | 1.000000 | 41.000000 | 31.000000 | 3.000000 | 2.000000 | 2.000000 | 2555.000000 | 359.000000 | 1324.500000 | 0.631000 | 2155.500000 | 45.000000 | 0.582000 | 0.023000 |
| 50% | 2.000000 | 46.000000 | 36.000000 | 4.000000 | 2.000000 | 2.000000 | 4549.000000 | 1276.000000 | 3474.000000 | 0.736000 | 3899.000000 | 67.000000 | 0.702000 | 0.176000 |
| 75% | 3.000000 | 52.000000 | 40.000000 | 5.000000 | 3.000000 | 3.000000 | 11067.500000 | 1784.000000 | 9859.000000 | 0.859000 | 4741.000000 | 81.000000 | 0.818000 | 0.503000 |
| max | 5.000000 | 73.000000 | 56.000000 | 6.000000 | 6.000000 | 6.000000 | 34516.000000 | 2517.000000 | 34516.000000 | 3.397000 | 18484.000000 | 139.000000 | 3.714000 | 0.999000 |
3. Modelagem & Avaliação#
3.1 Baseline#
# Separar os dados em treinamento e teste
X = data.drop(columns=[target])
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# zscore para mudar escala os dados
scaler = StandardScaler()
# modelo baseline
lrc = LogisticRegression(random_state=42)
# pipeline de machine learning
pipeb = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', lrc)
])
# cross validação da solução
cv_list_lr_baseline = cross_val_score(
pipeb,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)
print(f"Performance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (bac): 0.7834 +- 0.0227
3.2 Modelando com MLP#

1 camada oculta –> (10, )#
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo baseline
mlp = MLPClassifier(
hidden_layer_sizes=(10,), # numero de neurônios
activation = "relu", # função de ativação para a camada oculta
# As possibilidades são: identity, logistic, tanh e relu
alpha=0.0001, # Força do termo regularizador L2.
solver="adam", # otimizador
batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
learning_rate="constant", # possibilidades: constant, invscaling, adaptative
early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
# o treinamento quando a performance de validação não estiver melhorando.
max_iter=200, # número de iterações do otimizador
random_state=0
)
# pipeline de machine learning
pipe1 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', mlp)
])
# cross validação da solução
cv_list_pipe1_baseline = cross_val_score(
pipe1,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe1_baseline = np.mean(cv_list_pipe1_baseline)
std_cv_pipe1_baseline = np.std(cv_list_pipe1_baseline)
print(f"Performance (bac): {round(mean_cv_pipe1_baseline, 4)} +- {round(std_cv_pipe1_baseline, 4)}")
Performance (bac): 0.8612 +- 0.0277
2 camada oculta –> (10, 10)#
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo baseline
mlp = MLPClassifier(
hidden_layer_sizes=(10, 10), # número de neurônios
activation = "relu", # função de ativação para a camada oculta
# As possibilidades são: identity, logistic, tanh e relu
alpha=0.0001, # Força do termo regularizador L2.
solver="adam", # otimizador
batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
learning_rate="constant", # possibilidades: constant, invscaling, adaptative
early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
# o treinamento quando a performance de validação não estiver melhorando.
max_iter=200, # número de iterações do otimizador
random_state=0
)
# pipeline de machine learning
pipe2 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', mlp)
])
# cross validação da solução
cv_list_pipe2_baseline = cross_val_score(
pipe2,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe2_baseline = np.mean(cv_list_pipe2_baseline)
std_cv_pipe2_baseline = np.std(cv_list_pipe2_baseline)
print(f"Performance (bac): {round(mean_cv_pipe2_baseline, 4)} +- {round(std_cv_pipe2_baseline, 4)}")
/usr/local/lib/python3.7/dist-packages/sklearn/neural_network/_multilayer_perceptron.py:696: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
ConvergenceWarning,
Performance (bac): 0.8679 +- 0.0367
3 camada oculta –> (10, 10, 10)#
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# minmax para mudar escala os dados
scaler = MinMaxScaler()
# modelo baseline
mlp = MLPClassifier(
hidden_layer_sizes=(10, 10, 10), # número de neurônios
activation = "relu", # função de ativação para a camada oculta
# As possibilidades são: identity, logistic, tanh e relu
alpha=0.0001, # Força do termo regularizador L2.
solver="adam", # otimizador
batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
learning_rate="constant", # possibilidades: constant, invscaling, adaptative
early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
# o treinamento quando a performance de validação não estiver melhorando.
max_iter=200, # número de iterações do otimizador
random_state=0
)
# pipeline de machine learning
pipe3 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', mlp)
])
# cross validação da solução
cv_list_pipe3_baseline = cross_val_score(
pipe3,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe3_baseline = np.mean(cv_list_pipe3_baseline)
std_cv_pipe3_baseline = np.std(cv_list_pipe3_baseline)
print(f"Performance (bac): {round(mean_cv_pipe3_baseline, 4)} +- {round(std_cv_pipe3_baseline, 4)}")
Performance (bac): 0.9082 +- 0.0129
4 camada oculta –> (10, 10, 10, 10)#
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# zscore para mudar escala os dados
scaler = MinMaxScaler()
# modelo baseline
mlp = MLPClassifier(
hidden_layer_sizes=(10, 10, 10, 10), # número de neurônios
activation = "relu", # função de ativação para a camada oculta
# As possibilidades são: identity, logistic, tanh e relu
alpha=0.0001, # Força do termo regularizador L2.
solver="adam", # otimizador
batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
learning_rate="constant", # possibilidades: constant, invscaling, adaptative
early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
# o treinamento quando a performance de validação não estiver melhorando.
max_iter=200, # número de iterações do otimizador
random_state=0
)
# pipeline de machine learning
pipe4 = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', mlp)
])
# cross validação da solução
cv_list_pipe4_baseline = cross_val_score(
pipe4,
X_train,
y_train,
cv=10,
scoring="balanced_accuracy"
)
mean_cv_pipe4_baseline = np.mean(cv_list_pipe4_baseline)
std_cv_pipe4_baseline = np.std(cv_list_pipe4_baseline)
print(f"Performance (bac): {round(mean_cv_pipe4_baseline, 4)} +- {round(std_cv_pipe4_baseline, 4)}")
Performance (bac): 0.8684 +- 0.0163
Avaliação Experimental#
# resultados da cross-validação
df_result_cv = pd.DataFrame(
[cv_list_lr_baseline, cv_list_pipe1_baseline, cv_list_pipe2_baseline, cv_list_pipe3_baseline, cv_list_pipe4_baseline],
index=["baseline","MLP(10,)", "MLP(10, 10)", "MLP(10, 10, 10)", "MLP(10, 10, 10, 10)"]
).T
df_result_cv
| baseline | MLP(10,) | MLP(10, 10) | MLP(10, 10, 10) | MLP(10, 10, 10, 10) | |
|---|---|---|---|---|---|
| 0 | 0.761374 | 0.822778 | 0.853008 | 0.899676 | 0.872232 |
| 1 | 0.765576 | 0.830341 | 0.863092 | 0.911337 | 0.833886 |
| 2 | 0.768097 | 0.863092 | 0.853111 | 0.880267 | 0.866453 |
| 3 | 0.777053 | 0.908448 | 0.892769 | 0.905558 | 0.862067 |
| 4 | 0.777053 | 0.872520 | 0.879979 | 0.923286 | 0.896970 |
| 5 | 0.782279 | 0.834911 | 0.816342 | 0.901356 | 0.883341 |
| 6 | 0.836776 | 0.887542 | 0.889776 | 0.923470 | 0.868974 |
| 7 | 0.777893 | 0.849093 | 0.800295 | 0.903590 | 0.857865 |
| 8 | 0.815479 | 0.895800 | 0.919059 | 0.908426 | 0.882111 |
| 9 | 0.772090 | 0.847929 | 0.912003 | 0.924831 | 0.860140 |
# linearizar matriz
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
| fold | pipelines | balanced_accuracy | |
|---|---|---|---|
| 0 | 0 | baseline | 0.761374 |
| 1 | 0 | MLP(10,) | 0.822778 |
| 2 | 0 | MLP(10, 10) | 0.853008 |
| 3 | 0 | MLP(10, 10, 10) | 0.899676 |
| 4 | 0 | MLP(10, 10, 10, 10) | 0.872232 |
| 5 | 1 | baseline | 0.765576 |
| 6 | 1 | MLP(10,) | 0.830341 |
| 7 | 1 | MLP(10, 10) | 0.863092 |
| 8 | 1 | MLP(10, 10, 10) | 0.911337 |
| 9 | 1 | MLP(10, 10, 10, 10) | 0.833886 |
| 10 | 2 | baseline | 0.768097 |
| 11 | 2 | MLP(10,) | 0.863092 |
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
ROC curve#
from sklearn.metrics import DetCurveDisplay, RocCurveDisplay
pipelines = {
"baseline": pipeb,
"MLP(10,)": pipe1,
"MLP(10, 10)": pipe2,
"MLP(10, 10, 10)": pipe3,
"MLP(10, 10, 10, 10)": pipe4,
}
fig, [ax_roc, ax_det] = plt.subplots(1, 2, figsize=(11, 5))
for name, clf in pipelines.items():
clf.fit(X_train, y_train)
RocCurveDisplay.from_estimator(clf, X_test, y_test, ax=ax_roc, name=name)
DetCurveDisplay.from_estimator(clf, X_test, y_test, ax=ax_det, name=name)
ax_roc.set_title("Receiver Operating Characteristic (ROC) curves")
ax_det.set_title("Detection Error Tradeoff (DET) curves")
ax_roc.grid(linestyle="--")
ax_det.grid(linestyle="--")
plt.legend()
plt.show()
Final pipeline#
# retreinar o pipeline selecionado com todos os dados de treinamento
pipe3.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
bac = balanced_accuracy_score(y_test, y_pred)
print("Performance: ", round(bac, 4))
Performance: 0.8441
# Em vez de usar as predições podemos usar a probabilidade
# Desta maneira conseguiremos construir uma lista, ordenando dos mais aos menos prováveis de abandonar o produto
y_pred_prob = pipe3.predict_proba(X_test)
y_pred_prob[:, 0]
array([4.96625438e-05, 1.80525853e-03, 9.37936844e-01, ...,
1.35154815e-02, 7.06941771e-07, 6.21152944e-04])
Referências & Links#
Prática 6: Usando Algoritmos de Ensemble em um Problema de Regressão#
Descrição & Objetivo#
Descrição dos Dados: Consumidores fanáticos por abacate decidiram criar uma planilha com o preço médio anual em diversas regiões durante os anos de 2015 a 2018. Os seguintes campos foram anotados:
AveragePrice - preço médio de um abacate
type - tipo do abacate: conventional or organic
year - ano
Region - cidade ou região onde a observação foi feita
Total Volume - Número médio de abacates vendidos diariamente
4046 - Número médio diário de abacates vendidos com PLU 4046
4225 - Número médio diário de abacates vendidos com PLU 4225
4770 - Número médio diário de abacates vendidos com PLU 4770
Para entender mais sobre o PLU veja esse site.
Objetivo: Modelar o problema com Ensembles de forma a prever o preço médio de um abacate.
1. Imports#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
2. Leitura dos Dados#
# Leitura de dados
data = pd.read_csv("avocado_grouped_data.csv")
target = 'AveragePrice'
categorical_features = ["region", "type"]
numerical_features = ["year", "Total Volume", "4046", "4225", "4770"]
data
| year | type | region | AveragePrice | Total Volume | 4046 | 4225 | 4770 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2015 | conventional | Albany | 1.171923 | 7.620873e+04 | 1037.874615 | 61764.253654 | 668.795000 |
| 1 | 2015 | conventional | Atlanta | 1.052308 | 4.403464e+05 | 347741.840385 | 35386.637308 | 757.858077 |
| 2 | 2015 | conventional | BaltimoreWashington | 1.168077 | 7.681415e+05 | 56546.030769 | 487421.365385 | 45104.819423 |
| 3 | 2015 | conventional | Boise | 1.054038 | 7.088575e+04 | 45940.442500 | 10164.187115 | 5309.087692 |
| 4 | 2015 | conventional | Boston | 1.144038 | 5.237806e+05 | 4685.945192 | 409901.282692 | 1607.198846 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 427 | 2018 | organic | Syracuse | 1.242500 | 5.688144e+03 | 143.621667 | 101.041667 | 0.000000 |
| 428 | 2018 | organic | Tampa | 1.452500 | 8.415949e+03 | 125.270833 | 610.078333 | 0.000000 |
| 429 | 2018 | organic | TotalUS | 1.554167 | 1.510488e+06 | 137640.345000 | 360420.295000 | 1407.105833 |
| 430 | 2018 | organic | West | 1.613333 | 2.549791e+05 | 28676.619167 | 53221.085833 | 191.065833 |
| 431 | 2018 | organic | WestTexNewMexico | 1.645000 | 1.676254e+04 | 1918.260833 | 2275.085000 | 140.700833 |
432 rows × 8 columns
2. Análise Exploratória de Dados#
# Número de exemplos e features
data.shape
(432, 8)
# target distribution
data[target].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d2336d4d0>
# estatisticas da target
data[target].describe()
count 432.000000
mean 1.394253
std 0.320811
min 0.659808
25% 1.145817
50% 1.385064
75% 1.630096
max 2.403208
Name: AveragePrice, dtype: float64
# Verificar se tem NaN
data.isna().sum()
year 0
type 0
region 0
AveragePrice 0
Total Volume 0
4046 0
4225 0
4770 0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
| type | region | |
|---|---|---|
| count | 432 | 432 |
| unique | 2 | 54 |
| top | conventional | Albany |
| freq | 216 | 8 |
# Descrever dados numéricos
data.describe(include=[np.number])
| year | AveragePrice | Total Volume | 4046 | 4225 | 4770 | |
|---|---|---|---|---|---|---|
| count | 432.00000 | 432.000000 | 4.320000e+02 | 4.320000e+02 | 4.320000e+02 | 4.320000e+02 |
| mean | 2016.50000 | 1.394253 | 8.920706e+05 | 3.049745e+05 | 2.989822e+05 | 2.188050e+04 |
| std | 1.11933 | 0.320811 | 3.582444e+06 | 1.295333e+06 | 1.194733e+06 | 9.576609e+04 |
| min | 2015.00000 | 0.659808 | 1.289274e+03 | 2.679057e+00 | 6.390962e+00 | 0.000000e+00 |
| 25% | 2015.75000 | 1.145817 | 1.230606e+04 | 8.947879e+02 | 3.480897e+03 | 9.158173e-01 |
| 50% | 2016.50000 | 1.385064 | 1.180031e+05 | 1.063797e+04 | 3.042860e+04 | 3.231658e+02 |
| 75% | 2017.25000 | 1.630096 | 4.774981e+05 | 1.167135e+05 | 1.612054e+05 | 7.488512e+03 |
| max | 2018.00000 | 2.403208 | 4.212553e+07 | 1.467107e+07 | 1.243870e+07 | 1.153793e+06 |
3. Modelagem & Avaliação com Ensembles#
3.1 Baseline#
# Separar os dados em treinamento e teste
X = data.drop(columns=[target])
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("cat", OrdinalEncoder(), categorical_features),
],
remainder='passthrough'
)
# zscore para mudar escala os dados
scaler = StandardScaler()
# modelo baseline
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
# pipeline de machine learning
pipeb = Pipeline([
('column-transformer', ct),
('scaler', scaler),
('model', lr)
])
# cross validação da solução
cv_list_lr_baseline = cross_val_score(
pipeb,
X_train,
y_train,
cv=10,
scoring="r2"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)
print(f"Performance (r2): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (r2): 0.5537 +- 0.1066
3.2 Modelando com Random Forest e Gradient Boosting#
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# modelo RandomForestRegressor
rf = RandomForestRegressor(
n_estimators = 100,
criterion="squared_error",
random_state=0
)
# pipeline de machine learning
pipe1 = Pipeline([
('column-transformer', ct),
('model', rf)
])
# cross validação da solução
cv_list_pipe1_baseline = cross_val_score(
pipe1,
X_train,
y_train,
cv=10,
scoring="r2"
)
mean_cv_pipe1_baseline = np.mean(cv_list_pipe1_baseline)
std_cv_pipe1_baseline = np.std(cv_list_pipe1_baseline)
print(f"Performance (R2): {round(mean_cv_pipe1_baseline, 4)} +- {round(std_cv_pipe1_baseline, 4)}")
Performance (R2): 0.7605 +- 0.0821
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer(
transformers=[
("ohe", OneHotEncoder(), categorical_features),
],
remainder='passthrough'
)
# modelo GradientBoostingRegressor
rf = GradientBoostingRegressor(
n_estimators = 100,
loss="squared_error",
learning_rate=0.1,
random_state=0
)
# pipeline de machine learning
pipe2 = Pipeline([
('column-transformer', ct),
('model', rf)
])
# cross validação da solução
cv_list_pipe2_baseline = cross_val_score(
pipe2,
X_train,
y_train,
cv=10,
scoring="r2"
)
mean_cv_pipe2_baseline = np.mean(cv_list_pipe2_baseline)
std_cv_pipe2_baseline = np.std(cv_list_pipe2_baseline)
print(f"Performance (R2): {round(mean_cv_pipe2_baseline, 4)} +- {round(std_cv_pipe2_baseline, 4)}")
Performance (R2): 0.7936 +- 0.0495
Avaliação Experimental#
# resultados da cross-validação
df_result_cv = pd.DataFrame(
[cv_list_lr_baseline, cv_list_pipe1_baseline, cv_list_pipe2_baseline],
index=["baseline","RandomForestRegressor", "GradientBoostingRegressor"]
).T
df_result_cv
| baseline | RandomForestRegressor | GradientBoostingRegressor | |
|---|---|---|---|
| 0 | 0.538925 | 0.770106 | 0.792857 |
| 1 | 0.463798 | 0.557949 | 0.676694 |
| 2 | 0.630720 | 0.783451 | 0.820644 |
| 3 | 0.639834 | 0.730605 | 0.807127 |
| 4 | 0.402981 | 0.711545 | 0.817230 |
| 5 | 0.444394 | 0.778620 | 0.759449 |
| 6 | 0.543634 | 0.806082 | 0.770342 |
| 7 | 0.456912 | 0.746108 | 0.793083 |
| 8 | 0.720543 | 0.859275 | 0.822751 |
| 9 | 0.695542 | 0.860805 | 0.876062 |
# linearizar matriz
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
| fold | pipelines | balanced_accuracy | |
|---|---|---|---|
| 0 | 0 | baseline | 0.538925 |
| 1 | 0 | RandomForestRegressor | 0.770106 |
| 2 | 0 | GradientBoostingRegressor | 0.792857 |
| 3 | 1 | baseline | 0.463798 |
| 4 | 1 | RandomForestRegressor | 0.557949 |
| 5 | 1 | GradientBoostingRegressor | 0.676694 |
| 6 | 2 | baseline | 0.630720 |
| 7 | 2 | RandomForestRegressor | 0.783451 |
| 8 | 2 | GradientBoostingRegressor | 0.820644 |
| 9 | 3 | baseline | 0.639834 |
| 10 | 3 | RandomForestRegressor | 0.730605 |
| 11 | 3 | GradientBoostingRegressor | 0.807127 |
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
Final pipeline#
# retreinar o pipeline selecionado com todos os dados de treinamento
from sklearn.metrics import r2_score
pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
bac = r2_score(y_test, y_pred)
print("R2: ", round(bac, 4))
R2: 0.8158
Visualização do Erro#
from yellowbrick.regressor import prediction_error
visualizer = prediction_error(pipe2, X_train, y_train, X_test, y_test)
from yellowbrick.regressor import residuals_plot
viz = residuals_plot(pipe2, X_train, y_train, X_test, y_test)
Referências & Links#
2 - ESTATÍSTICA - ECD#
Revisão de probabilidade e estatística clássica#
1 - Probabilidades#
Exemplo: Lançando dois dados equilibrados, qual é a probabilidade de que:
a) A soma das faces seja igual a 7 (evento A).
a) Obter uma soma maior do que 5 (evento B).
Na aula, vimos que:
$\(
P(A) = 6/36 = 1/6 = 0.166
\)$
import random
import numpy as np
n = 1000 #numero de experimentos
nA = 0
nB = 0
nf = 6 # número de faces
faces = np.arange(1,nf+1) #valores 1 a nf
for i in range(0,n):
dado1 = random.choice(faces)
dado2 = random.choice(faces)
if (dado1+dado2 == 7):
nA = nA + 1
if((dado1 + dado2) > 5):
nB = nB + 1
nA = nA/n
nB = nB/n
print('Probabilidade de que a soma das faces seja igual a 7:', nA)
print('Probabilidade de que a soma das faces seja maior do que 5:', nB)
Probabilidade de que a soma das faces seja igual a 7: 0.15
Probabilidade de que a soma das faces seja maior do que 5: 0.716
faces
array([1, 2, 3, 4, 5, 6])
Exemplo: Qual é a probabilidade de que em um lançamento de dados saia um número par ou maior do que três?
Solução: 4/6 = 0.66.
import random
import numpy as np
n = 1000 #numero de experimentos
nA = 0
faces = np.arange(1,7) #valores 1 a 6
for i in range(0,n):
dado = random.choice(faces)
if (dado%2 == 0) or (dado > 3):
nA = nA + 1
nA = nA/n
print('Probabilidade de que saia um número par ou maior do que três:', nA)
Probabilidade de que saia um número par ou maior do que três: 0.664
2 - Modelos Probabilísticos#
Exemplo: Seja a variável aleatória X com distribuição abaixo. Calcule E[X] e V(X). $\( P(X=0) = 0.2,\quad P(X=1) = 0.2, \quad P(X = 2) = 0.6 \)\( O valor esperado: \)\( E[X] = 0*0.2 + 1*0.2 + 2*0.6 = 1.4 \)\( \)\( V(X) = E[X^2]-E[X]^2 = 0.64 \)$
import random
import numpy as np
n = 1000 #numero de experimentos
nA = 0
nB = 0
X = [0,0,1,1,2,2,2,2,2]
x_obs = []
for i in range(0,n):
x_obs.append(random.choice(X))
print('Valor esperado de X:', np.mean(x_obs))
print('Variância de X:', np.std(x_obs)**2)
Valor esperado de X: 1.346
Variância de X: 0.6602840000000001
Distribuição binomial:#
from scipy.stats import binom
from matplotlib import pyplot as plt
import numpy as np
import math as math
np.random.seed(100)
n = 100 # numero de lançamentos
p = 0.3 # probabilidade de sair cara
ns = 1000 # numero de simulacoes
X = np.random.binomial(n, p, ns) # funcao para gerar valores de uma binomial
plt.figure(figsize=(10,6))
Pk, bins, ignored = plt.hist(X, bins='auto', density=True, color='#0504aa',alpha=0.7,
rwidth=0.9)
# curva teórica
Pkt = np.zeros(n+1) # valores teóricos da probabilidade
vkt = np.arange(0,n+1) # variação em k
for k in range(0,n+1): # varia de 0 até n
C = (math.factorial(n)/(math.factorial(n-k)*math.factorial(k)))
Pkt[k] = C*(p**k)*(1-p)**(n-k)
plt.plot(vkt, Pkt, 'r--', label='Prob. Teórica')
plt.xlabel('k', fontsize = 20)
plt.ylabel('P(k)',fontsize = 20)
plt.legend(fontsize = 15)
plt.xlim(10,50)
plt.show(True)
Exemplo: Em uma urna há 8 bolas brancas e 4 pretas. Retira-se 5 bolas com reposição. Calcule a probabilidade de que:
a) saiam duas bolas brancas.
import math
def binomial(n,p,k):
C = (math.factorial(n)/(math.factorial(n-k)*math.factorial(k)))
pk = C*(p**k)*(1-p)**(n-k)
return pk
n = 5
p = 8/12
k = 2
print('Probabilidade:', binomial(n,p,k))
Probabilidade: 0.16460905349794244
from scipy.stats import binom
#binom.pmf(k) = choose(n, k) * p**k * (1-p)**(n-k)
ns = 1000 #numero de experimentos
X = ['B','B','B','B','B','B','B','B','P','P','P','P']
n = 5 # numero de bolas retiradas
k = 0
for i in range(0,ns):
saida = []
for j in range(0,n):
bola = random.choice(X)
saida.append(bola)
nbrancas = 0
for s in saida:
if(s == 'B'):
nbrancas = nbrancas + 1
if(nbrancas == 2):
k = k + 1 # se sair branca, temos mais um sucesso
print('Valor teórico:', binom.pmf(2, 5, 8/12))
print('Valor obtido = ', k/ns)
Valor teórico: 0.1646090534979425
Valor obtido = 0.166
Modelo de Poisson:#
Exemplo: Em uma central telefônica, chegam 300 mensagens por hora. Qual é a probabilidade de que em um minuto não ocorra nenhuma chamada.
import numpy as np
import math
def Poisson(lbd, k):
pk = np.exp(-lbd)*(lbd**k)/math.factorial(k)
return pk
lbd = 5 #numero de chamadas por minuto
k = 0
print("P(k = 0) = ",Poisson(lbd,k))
P(k = 0) = 0.006737946999085467
Modelo exponencial#
from scipy.stats import expon
import matplotlib.pyplot as plt
import numpy as np
alpha = 2
X = expon.rvs(scale=1/alpha,size=1000)
plt.figure(figsize=(8,5))
P, bins, ignored = plt.hist(X, bins='auto', density=True, color='#0504aa',alpha=0.7,
rwidth=0.9)
plt.xlabel('k', fontsize = 15)
plt.ylabel('P(k)',fontsize = 15)
plt.show(True)
print('Esperanca teorica:', 1/alpha, 'Média amostral:', np.mean(X))
print('Variância teórica:', 1/alpha**2,'Variância amostral:', np.var(X))
Esperanca teorica: 0.5 Média amostral: 0.4636687124091369
Variância teórica: 0.25 Variância amostral: 0.21690643659900463
Modelo Normal#
import numpy as np
import matplotlib.pyplot as plt
import math as math
# funcao que retorna a distribuicao teorica
def normal_dist(x , mean , sigma):
prob_density = (1/(sigma*(math.sqrt(2*np.pi))))*np.exp(-0.5*((x-mean)/sigma)**2)
return prob_density
# gera n amostras de uma normal e mostra o histograma
mean = 75
sigma = 20
n = 10000
X = np.random.normal(mean, sigma, n)
plt.figure(figsize=(10,6))
Pk, bins, ignored = plt.hist(X, bins='auto', density=True, color='#0504aa',alpha=0.7,
rwidth=0.9)
# define os valores de x
x = np.linspace(np.min(X),np.max(X),200)
# Distribuicao teorica
pdf = normal_dist(x,mean,sigma)
# Mostra os resultados
plt.plot(x,pdf , color = 'red')
plt.xlabel('Data points')
plt.ylabel('Probability Density')
Text(0, 0.5, 'Probability Density')
Exemplo: O peso médio de 500 estudantes do sexo masculino de uma determinada universidade é 75,5 Kg e o desvio padrão é 7,5 Kg. Admitindo que os pesos são normalmente distribuídos, determine a percentagem de estudantes que pesam:
a) entre 60 e 77,5 Kg. $\( P(60 \leq X \leq 77,5) = P\left(\frac{60-\mu}{\sigma} \leq \frac{X-\mu}{\sigma} \leq \frac{77,5-\mu}{\sigma}\right)=P\left(\frac{60-\mu}{\sigma} \leq Z \leq \frac{77,5-\mu}{\sigma}\right) = \)\( \)\( = P\left(Z \leq \frac{77,5-\mu}{\sigma}\right)-P\left( Z \leq \frac{60-\mu}{\sigma}\right) \)$
import scipy.stats as st
media = 75.5
dp = 7.5
z1 = (60-media)/dp
z2 = (77.5-media)/dp
print('Probabilidade teórica:',st.norm.cdf(z2)-st.norm.cdf(z1))
Probabilidade teórica: 0.5857543024471563
Podemos ainda simular esse problema. Para isso, assumimos que a população segue uma normal e geramos os dados usando a função np.random.normal.
media = 75.5
dp = 7.5
n = 100
X = np.random.normal(media, dp, n)
m = 0
for x in X:
if x > 60 and x < 77.5:
m = m + 1
print('Probabilidade (simulação):', m/n)
Probabilidade (simulação): 0.67
3 - Teorema Central do Limite#
Teorema: Seja uma amostra aleatória \((X_1,X_2,\ldots,X_n)\) retiradas de uma população com média \(\mu\) e variância \(\sigma\). A distribuição amostral de \(\bar{X}\) aproxima-se, para n grande, de uma distribuição normal com média \(E[\bar{X}]=\mu\) e variância \(\sigma^2/n\).
import scipy.stats as stats
vS = [1, 2 , 4 , 8, 50, 100, 1000]# tamanho da amostra
S = 500 # número de amostras (fixo)
mu = 2 # média
for n in vS:
vmean = []
for s in range(0,S): # Seleciona S amostras de tamanho n
# n amostras de uma distribuição uniforme
X = np.random.uniform(0,2*mu, n)
# n amostras de uma distribuição exponencial
#X = np.random.exponential(mu, n)
vmean.append(np.mean(X))
# mostra os resultados
plt.figure(figsize=(6,4))
plt.hist(x=vmean, bins='auto', color='#0504aa', alpha=0.7, rwidth=0.85, density=True)
plt.xlabel(r'$\bar{X}$', fontsize=20)
plt.ylabel(r'$P(\bar{X})$', fontsize=20)
# Mostra a curva teórica
xmin, xmax = min(vmean), max(vmean)
lnspc = np.linspace(xmin, xmax, len(vmean))
m, s = stats.norm.fit(vmean) # média e desvio padrão da curva ajustada
pdf_g = stats.norm.pdf(lnspc, m, s)
plt.plot(lnspc, pdf_g, label="Norm")
plt.show(True)
Exemplo: Seja a variável aleatória com distribuição de probabilidade: P(X=3)=0,4; P(X=6)=0,3; P(X=8)=0,3. Uma amostra com 40 observações é sorteada. Qual é a probabilidade de que a média amostral ser maior do que 5?
import scipy.stats as st
import numpy as np
def esperanca(X,P):
E = 0
for i in range(0, len(X)):
E = E + X[i]*P[i]
return E
def variancia(X,P):
E = 0; E2 = 0
for i in range(0, len(X)):
E = E + X[i]*P[i]
E2 = E2 + (X[i]**2)*P[i]
V = E2-E**2
return V
X = [3,6,8] # valores de X
P = [0.4,0.3,0.3] # valores da probabilidade
E = esperanca(X,P)
V = variancia(X,P)
print("Esperança:", E, "Variância:",V)
mu = E
sigma = np.sqrt(V)
n = 40 # tamanho da amostra
x = 5 # valor a ser testado
Zt = (x - mu)/(sigma/np.sqrt(n))
pt = 1-st.norm.cdf(Zt)
print('Probabilidade:',pt)
Esperança: 5.4 Variância: 4.439999999999991
Probabilidade: 0.885046886863795
Podemos realizar uma simulação para verificar esse resultado. Para isso, vamos sortear várias amostras de tamanho n=40 e verificar qual a probabilidade da média dessa amostra ser maior do que 5.
import matplotlib.pyplot as plt
n = 40
ns = 1000 #numero de simulacoes
vx = [] # armazena a media amostral
for s in range(0,ns):
A = np.random.choice(X, n, p=P)
vx.append(np.mean(A))
plt.figure(figsize=(8,6))
plt.hist(x=vx, bins='auto',color='#0504aa',
alpha=0.7, rwidth=0.85, density = True)
plt.xlabel(r'$\bar{X}$', fontsize=20)
plt.ylabel(r'$P(\bar{X})$', fontsize=20)
plt.show(True)
print("Media das amostras:", np.mean(vx), "Media da população:", E)
#probabilidade de ser maior do que 5
nmaior = 0
for i in range(0, len(vx)):
if(vx[i] > 5):
nmaior = nmaior + 1
nmaior = nmaior/len(vx)
print("Probabilidade de ser maior do que 5:", nmaior, "Valor teórico:", pt)
Media das amostras: 5.381975000000001 Media da população: 5.4
Probabilidade de ser maior do que 5: 0.848 Valor teórico: 0.885046886863795
Teste de hipóteses#
Exemplo: Estudantes acreditam que a média da turma em um curso de estatística é igual a 65. O professor acredita que a média é maior. Para verificar essas hipóteses, ele seleciona notas de 10 estudantes, obtemos os valores [65, 65, 70, 67, 66, 63, 63, 68, 72, 71]. Assuma que as notas são normalmente distribuídas, calcule o valor p.
\(H_0: \mu = 65\)
\(H_1: \mu > 65\)
import numpy as np
X = [65, 65, 70, 67, 66, 63, 63, 68, 72, 71]
m = 65
n = len(X)
s = np.std(X, ddof=1)
xobs = np.mean(X)
print('s = ', s)
print('xobs = ',xobs)
s = 3.197221015541813
xobs = 67.0
talpha = (xobs - m)/(s/np.sqrt(n))
print('talpha = ', talpha)
talpha = 1.978141420187361
import scipy.stats
alpha = scipy.stats.t.cdf(talpha, n-1)
print('valor p =',1 - alpha)
valor p = 0.03964824393588806
Logo, o valor p é igual 0,039, indicando uma forte evidência para rejeitarmos \(H_0\).
3 - TÉCNICAS AVANÇADAS - TACTD#
Conteúdo#
Webscraping
Frequencia de palavras
DataFrame
O site:
https://lite.cnn.com/2024/01/09/americas/armed-men-interrupt-live-tv-ecuador-intl/index.html
corresponde a um arquivo html cuja maior parte do conteúdo é uma notícia sobre os recentes ataque no equador. Vamos fazer uma requisição ao site e armazenar o conteúdo (arquivo html) em uma string.
import requests as rq
url = 'https://lite.cnn.com/2024/01/09/americas/armed-men-interrupt-live-tv-ecuador-intl/index.html'
pagina_head = rq.head(url)
print(pagina_head.status_code)
print('Tipo do Conteúdo:',pagina_head.headers['Content-Type'])
200
Tipo do Conteúdo: text/html; charset=utf-8
pg = rq.get(url)
texto = pg.text
print('Número de caracteres:',len(texto))
print(texto[:200])
Número de caracteres: 63427
<!DOCTYPE html>
<html lang="en" data-layout-uri="cms.cnn.com/_layouts/layout-with-rail/instances/world-article-v1@published">
<head><style>body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helvetic
Como estamos interessados somenent no texto, vamos usar o BeautifulSoup para extrair o conteúdo, o qual está delimitado no html por
<p class="paragraph--lite">
texto texto texto...
</p>
from bs4 import BeautifulSoup
soup = BeautifulSoup(texto, 'html.parser')
texto_limpo = ""
for evento in soup('p', {'class': 'paragraph--lite'}):
data = evento.text
#print(data)
texto_limpo = texto_limpo + ' ' + data
print(texto_limpo)
Ecuador’s President Daniel Noboa has declared an “internal armed conflict” in the country, ordering security forces to “neutralize” several criminal groups accused of spreading extreme violence in the Latin American nation.
The decree came shortly after hooded and armed men interrupted a live television broadcast – one of several violent incidents playing out across the country on Tuesday.
Ecuadorians were stunned as they watched the takeover of TC Television’s live broadcast from the coastal city of Guayaquil. Social media video showed the assailants forcing staff of the state-owned network onto the floor of the studio as shots and yelling were heard in the background.
Ecuador’s police later said they had arrested all the armed men, members of the media outlet had been evacuated, and all staff and hostages were alive.
At least four firearms, two grenades, and “explosive material” were recovered and 13 people apprehended, César Zapata, General Commander of the National Police said. The perpetrators would be brought to justice for their “acts of terrorism,” he added.
TC Television anchor Jorge Rendon described the takeover of the broadcast as an “extremely violent attack.”
“They wanted to enter the studio so that we could say what they wanted, I guess their message,” Rendon recalled in a video on TC Television’s official X account. Rendon said he knew of one person being shot and another injured by the assailants. Police have not confirmed those injuries.
The situation has struck fear among many Ecuadorians. One woman, who lives outside Guayaquil and was told to go home early by her boss, described the chaotic traffic on her drive home. “Cars were going the wrong way; everyone was just trying to get through,” she said.
“The scariest part was seeing the desperation, seeing businesses shutting down, desperate people, including children and women, running frantically in avenues only meant for cars.”
The country has been rocked by explosions, police kidnappings, and prison disturbances since Noboa on Monday declared a nationwide state of emergency after high-profile gang leader Adolfo “Fito” Macias escaped from a prison in Guayaquil.
Eight people were killed in Guayaquil on Tuesday, according to local police. Two police officers were also killed in the nearby city of Nobol, National Police said on X.
Meanwhile, 10 people were arrested after three kidnapped police officers were freed in the southwest city of Machala, National Police said Tuesday night. Earlier, police said at least seven officers had been taken captive in three cities since the state of emergency was announced.
At least 70 people were arrested across the whole country, police said Wednesday morning. Eight explosive devices were seized along with 15 Molotov cocktails, nine firearms, 308 firearm cartridges, six motorcycles and six vehicles.
Ecuador is “living a real nightmare,” former President Rafael Correa said in a video shared on X Tuesday. The situation was “the result of the systematic destruction of the rule of law, of the errors of hatred accumulated over the last seven years,” he claimed.
The state of emergency will last for 60 days and mobilize the police and armed forces to control disturbances to public order.
It includes a curfew, from 11 p.m. to 5 a.m., to restrict meetings and actions that may threaten public order. Noboa’s beleaguered predecessor, former President Guillermo Lasso, instated several states of emergency with limited success.
The decree signed by Noboa on Tuesday declared the country was in an “internal armed conflict” and ordered armed forces to carry out military operations to “neutralize” armed groups identified as terrorists.
Adm. Jaime Vela Erazo, head of the Joint Command of Ecuador’s Armed Forces, on Tuesday vowed not to “back down or negotiate” with armed groups, adding the “future of our country is at stake.”
“From this moment on, every terrorist group identified in the aforementioned [emergency] decree has become a military target,” he said.
The spiraling violence is the most extreme test yet for the new president, who won last year’s run-off vote with promises to tackle soaring crime.
Ecuador’s worsening security situation is largely driven by rival criminal organizations, which have been meting out brutal and often public shows of violence in the country’s streets and prisons in their battle to control drug trafficking routes.
In one of the kidnappings this week, in which three agents were taken, an explosive device had been “placed and detonated” in a vehicle the officers were moving in, police said.
In the northwestern city of Esmeraldas, two vehicles were set on fire with one causing a blaze at a gas station.
In Guayaquil, one hospital said security guards had stopped armed individuals from entering the facility, denying reports that some health personnel had been kidnapped. The military is now guarding the hospital, it said.
And in the capital Quito, police found a burned vehicle with traces of gas cylinders inside. Residents reported on social media they had heard a loud explosion in the area.
Police also said they had received reports of an explosion at a pedestrian bridge outside Quito and attended “over 20 emergencies during (Monday) evening and overnight (Tuesday) in different parts of the country.” No known casualties related to the explosions were immediately reported.
Amid the unrest in Ecuador, countries across the region, including neighboring Colombia and Peru, expressed concern over the situation and support for Noboa’s government to restore order.
Officials in Peru said the country plans to declare an emergency along its entire northern border with Ecuador. Peru’s interior minister has also ordered National Police to reinforce security on the border, the interior ministry said.
In a statement on X, a US State Department official said the United States stands with the people of Ecuador and is “ready to provide assistance to the Ecuadorian government.”
Ecuador’s penitentiary service, the SNAI, said at least six incidents took place inside prisons Monday, including disturbances and retention of prison guards. The situation in prisons was not under control, it said.
Meanwhile, another alleged gang leader, Fabricio Colon Pico, escaped from a prison in the central city of Riobamba, according to its mayor John Vinueza.
Colon Pico had been captured last Friday after being publicly identified by Attorney General Diana Salazar as being part of a plan to attack her. Along with Colon Pico, 38 other inmates escaped, of which 12 have been recaptured, the SNAI told CNN.
Ecuador’s Armed Forces said they carried out control operations Monday night and early Tuesday in the most conflict-ridden areas.
On the political side, Ecuador’s National Assembly is holding an emergency meeting to “generate concrete actions in face of the national commotion and multiple acts that threaten public peace.”
Speaking to Radio Canela on Wednesday, Noboa said the prison officials who were on duty when Fito escaped will be prosecuted, warning his country is in a “state of war” against “terrorist groups.”
The search for Adolfo Macias, more popularly known by his alias “Fito,” continued as more than 3,000 police officers and members of the armed forces have been deployed to find him, the government said Sunday. Ecuador authorities said they have not yet pinpointed the exact time and date that Macias escaped prison.
Macías is the leader of Los Choneros, one of Ecuador’s most feared gangs, which has been linked to maritime drug trafficking to Mexico and the United States, working with with Mexico’s Sinaloa cartel and the Oliver Sinisterra Front in Colombia, according to the Insight Crime research center.
He was jailed after being convicted of drug trafficking. Before his assassination, the late Ecuadorian presidential candidate Fernando Villavicencio said in July that he had been threatened by Macías and warned against continuing with his campaign against gang violence for the leadership.
This story has been updated with additional developments.
See Full Web Article
Vamos construir, a partir da string armazenada acima, uma lista de palavras que aparecem no documento html. As palavras da lista devem ser constituídas apenas por letras do alfabeto, possuírem mais que 1 caracter e não serem ‘stop words’.
Vamos utilizar a biblioteca nltk para construir a lista de palavras.
import nltk
from nltk.corpus import stopwords
from collections import Counter
nltk.download('stopwords')
stop_words = stopwords.words('english')
print(len(stop_words))
179
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/gnonato/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
words = nltk.word_tokenize(texto_limpo)
words = [w.lower() for w in words if w.isalpha() and len(w) != 1]
words = [w for w in words if w not in stop_words]
print(words[:100])
['ecuador', 'president', 'daniel', 'noboa', 'declared', 'internal', 'armed', 'conflict', 'country', 'ordering', 'security', 'forces', 'neutralize', 'several', 'criminal', 'groups', 'accused', 'spreading', 'extreme', 'violence', 'latin', 'american', 'nation', 'decree', 'came', 'shortly', 'hooded', 'armed', 'men', 'interrupted', 'live', 'television', 'broadcast', 'one', 'several', 'violent', 'incidents', 'playing', 'across', 'country', 'tuesday', 'ecuadorians', 'stunned', 'watched', 'takeover', 'tc', 'television', 'live', 'broadcast', 'coastal', 'city', 'guayaquil', 'social', 'media', 'video', 'showed', 'assailants', 'forcing', 'staff', 'network', 'onto', 'floor', 'studio', 'shots', 'yelling', 'heard', 'background', 'ecuador', 'police', 'later', 'said', 'arrested', 'armed', 'men', 'members', 'media', 'outlet', 'evacuated', 'staff', 'hostages', 'alive', 'least', 'four', 'firearms', 'two', 'grenades', 'explosive', 'material', 'recovered', 'people', 'apprehended', 'césar', 'zapata', 'general', 'commander', 'national', 'police', 'said', 'perpetrators', 'would']
Vamos realizar a normalização lexica das palavras da lista de palavras.
from nltk.stem import PorterStemmer
words = [PorterStemmer().stem(w) for w in words]
print(words[:100])
['ecuador', 'presid', 'daniel', 'noboa', 'declar', 'intern', 'arm', 'conflict', 'countri', 'order', 'secur', 'forc', 'neutral', 'sever', 'crimin', 'group', 'accu', 'spread', 'extrem', 'violenc', 'latin', 'american', 'nation', 'decr', 'came', 'shortli', 'hood', 'arm', 'men', 'interrupt', 'live', 'televi', 'broadcast', 'one', 'sever', 'violent', 'incid', 'play', 'across', 'countri', 'tuesday', 'ecuadorian', 'stun', 'watch', 'takeov', 'tc', 'televi', 'live', 'broadcast', 'coastal', 'citi', 'guayaquil', 'social', 'media', 'video', 'show', 'assail', 'forc', 'staff', 'network', 'onto', 'floor', 'studio', 'shot', 'yell', 'heard', 'background', 'ecuador', 'polic', 'later', 'said', 'arrest', 'arm', 'men', 'member', 'media', 'outlet', 'evacu', 'staff', 'hostag', 'aliv', 'least', 'four', 'firearm', 'two', 'grenad', 'explo', 'materi', 'recov', 'peopl', 'apprehend', 'césar', 'zapata', 'gener', 'command', 'nation', 'polic', 'said', 'perpetr', 'would']
Vamos calcular a frequência das palavras e montar um dicionário ordenado de acordo com as frequências, da maior para a menor.
d = dict(Counter(words))
print(list(d.items())[:10])
[('ecuador', 13), ('presid', 4), ('daniel', 1), ('noboa', 6), ('declar', 4), ('intern', 2), ('arm', 12), ('conflict', 2), ('countri', 10), ('order', 6)]
dsorted = sorted(d.items(), key=lambda x: x[1], reverse=True)
#print(type(dsorted))
for p,f in dsorted:
print(p,'-->',f)
said --> 24
polic --> 17
ecuador --> 13
arm --> 12
countri --> 10
prison --> 9
tuesday --> 8
state --> 8
emerg --> 8
forc --> 7
nation --> 7
one --> 7
explos --> 7
noboa --> 6
order --> 6
citi --> 6
peopl --> 6
guayaquil --> 5
situat --> 5
escap --> 5
offic --> 5
presid --> 4
declar --> 4
secur --> 4
group --> 4
violenc --> 4
live --> 4
televis --> 4
ecuadorian --> 4
least --> 4
offici --> 4
includ --> 4
kidnap --> 4
monday --> 4
gang --> 4
vehicl --> 4
last --> 4
control --> 4
public --> 4
report --> 4
sever --> 3
extrem --> 3
decre --> 3
broadcast --> 3
across --> 3
tc --> 3
media --> 3
video --> 3
arrest --> 3
firearm --> 3
two --> 3
gener --> 3
rendon --> 3
part --> 3
see --> 3
disturb --> 3
leader --> 3
fito --> 3
macia --> 3
accord --> 3
also --> 3
three --> 3
along --> 3
six --> 3
threaten --> 3
militari --> 3
identifi --> 3
terrorist --> 3
drug --> 3
traffick --> 3
guard --> 3
peru --> 3
colon --> 3
pico --> 3
intern --> 2
conflict --> 2
neutral --> 2
crimin --> 2
men --> 2
violent --> 2
incid --> 2
takeov --> 2
social --> 2
show --> 2
assail --> 2
staff --> 2
studio --> 2
shot --> 2
heard --> 2
member --> 2
command --> 2
act --> 2
ad --> 2
describ --> 2
want --> 2
enter --> 2
anoth --> 2
fear --> 2
outsid --> 2
told --> 2
go --> 2
home --> 2
earli --> 2
desper --> 2
sinc --> 2
adolfo --> 2
eight --> 2
kill --> 2
meanwhil --> 2
night --> 2
seven --> 2
taken --> 2
wednesday --> 2
devic --> 2
former --> 2
year --> 2
meet --> 2
action --> 2
carri --> 2
oper --> 2
yet --> 2
crime --> 2
place --> 2
ga --> 2
hospit --> 2
quito --> 2
insid --> 2
area --> 2
known --> 2
colombia --> 2
govern --> 2
plan --> 2
border --> 2
interior --> 2
unit --> 2
snai --> 2
warn --> 2
continu --> 2
macía --> 2
mexico --> 2
daniel --> 1
accus --> 1
spread --> 1
latin --> 1
american --> 1
came --> 1
shortli --> 1
hood --> 1
interrupt --> 1
play --> 1
stun --> 1
watch --> 1
coastal --> 1
network --> 1
onto --> 1
floor --> 1
yell --> 1
background --> 1
later --> 1
outlet --> 1
evacu --> 1
hostag --> 1
aliv --> 1
four --> 1
grenad --> 1
materi --> 1
recov --> 1
apprehend --> 1
césar --> 1
zapata --> 1
perpetr --> 1
would --> 1
brought --> 1
justic --> 1
terror --> 1
anchor --> 1
jorg --> 1
could --> 1
say --> 1
guess --> 1
messag --> 1
recal --> 1
account --> 1
knew --> 1
person --> 1
injur --> 1
confirm --> 1
injuri --> 1
struck --> 1
among --> 1
mani --> 1
woman --> 1
boss --> 1
chaotic --> 1
traffic --> 1
drive --> 1
car --> 1
wrong --> 1
way --> 1
everyon --> 1
tri --> 1
get --> 1
scariest --> 1
busi --> 1
shut --> 1
children --> 1
women --> 1
run --> 1
frantic --> 1
avenu --> 1
meant --> 1
rock --> 1
nationwid --> 1
local --> 1
nearbi --> 1
nobol --> 1
freed --> 1
southwest --> 1
machala --> 1
earlier --> 1
captiv --> 1
announc --> 1
whole --> 1
morn --> 1
seiz --> 1
molotov --> 1
cocktail --> 1
nine --> 1
cartridg --> 1
motorcycl --> 1
real --> 1
nightmar --> 1
rafael --> 1
correa --> 1
share --> 1
result --> 1
systemat --> 1
destruct --> 1
rule --> 1
law --> 1
error --> 1
hatr --> 1
accumul --> 1
claim --> 1
day --> 1
mobil --> 1
curfew --> 1
restrict --> 1
may --> 1
beleagu --> 1
predecessor --> 1
guillermo --> 1
lasso --> 1
instat --> 1
limit --> 1
success --> 1
sign --> 1
jaim --> 1
vela --> 1
erazo --> 1
head --> 1
joint --> 1
vow --> 1
back --> 1
negoti --> 1
futur --> 1
moment --> 1
everi --> 1
aforement --> 1
becom --> 1
target --> 1
spiral --> 1
test --> 1
new --> 1
vote --> 1
promis --> 1
tackl --> 1
soar --> 1
worsen --> 1
larg --> 1
driven --> 1
rival --> 1
organ --> 1
mete --> 1
brutal --> 1
often --> 1
street --> 1
battl --> 1
rout --> 1
week --> 1
agent --> 1
deton --> 1
move --> 1
northwestern --> 1
esmeralda --> 1
set --> 1
fire --> 1
caus --> 1
blaze --> 1
station --> 1
stop --> 1
individu --> 1
facil --> 1
deni --> 1
health --> 1
personnel --> 1
capit --> 1
found --> 1
burn --> 1
trace --> 1
cylind --> 1
resid --> 1
loud --> 1
receiv --> 1
pedestrian --> 1
bridg --> 1
attend --> 1
even --> 1
overnight --> 1
differ --> 1
casualti --> 1
relat --> 1
immedi --> 1
amid --> 1
unrest --> 1
region --> 1
neighbor --> 1
express --> 1
concern --> 1
support --> 1
restor --> 1
entir --> 1
northern --> 1
minist --> 1
reinforc --> 1
ministri --> 1
statement --> 1
us --> 1
depart --> 1
stand --> 1
readi --> 1
provid --> 1
assist --> 1
penitentiari --> 1
servic --> 1
took --> 1
retent --> 1
alleg --> 1
fabricio --> 1
central --> 1
riobamba --> 1
mayor --> 1
john --> 1
vinueza --> 1
captur --> 1
friday --> 1
publicli --> 1
attorney --> 1
diana --> 1
salazar --> 1
attack --> 1
inmat --> 1
recaptur --> 1
cnn --> 1
polit --> 1
side --> 1
assembl --> 1
hold --> 1
concret --> 1
face --> 1
commot --> 1
multipl --> 1
speak --> 1
radio --> 1
canela --> 1
duti --> 1
prosecut --> 1
war --> 1
search --> 1
popularli --> 1
alia --> 1
deploy --> 1
find --> 1
sunday --> 1
author --> 1
pinpoint --> 1
exact --> 1
time --> 1
date --> 1
lo --> 1
chonero --> 1
link --> 1
maritim --> 1
work --> 1
sinaloa --> 1
cartel --> 1
oliv --> 1
sinisterra --> 1
front --> 1
insight --> 1
research --> 1
center --> 1
jail --> 1
convict --> 1
assassin --> 1
late --> 1
presidenti --> 1
candid --> 1
fernando --> 1
villavicencio --> 1
juli --> 1
campaign --> 1
leadership --> 1
stori --> 1
updat --> 1
addit --> 1
develop --> 1
full --> 1
web --> 1
articl --> 1
Para finalizar, vamos salvar a lista de palavras e suas respectivas frequências em um arquivo com formato .csv.
import pandas as pd
df = pd.DataFrame(dsorted, columns = ['Word', 'Freq'])
print(df.head())
Word Freq
0 said 24
1 polic 17
2 ecuador 13
3 arm 12
4 countri 10
df.to_csv('word_freq.csv',index=False)
4 - DW#
#PREENCHA SEU NOME COMPLETO AQUI:
MBA em Ciência de Dados#
Redes Neurais e Arquiteturas Profundas#
Análise de Dados com Base em Processamento Massivo em Paralelo#
Prova Final#
Material Produzido por:
Profa. Dra. Cristina Dutra de Aguiar
Prof. Dr. Moacir A. Ponti
CEMEAI - ICMC/USP São Carlos
A prova final contém 1 questão, dividida em 3 itens. Por favor, procurem por Questão para encontrar a especificação da questão e por RESOLVER para encontrar a especificação do item a ser solucionado. Também é possível localizar a questão e os itens utilizando o menu de navegação.
O notebook contém a constelação de fatos da BI Solutions que deve ser utilizada para responder à questão e também todas as bibliotecas, bases de dados, inicializações, instalações, importações, geração de dataFrames, geração de visões temporárias e conversão dos tipos de dados necessárias para a realização da questão.
INSTRUÇÕES:
Você deve exportar esse notebook com sua solução para as questões da prova em formato .py e fazer upload no Moodle. Atenção: você não deve fazer upload de um arquivo notebook (.ipynb), mas sim um arquivo texto .py contendo os códigos python que utilizou para resolver as questões. O arquivo .py pode ser gerado através da opção:
File –> Download as –> Python (.py) disponível no Jupyter Notebook.
ou File –> Download .py no Google Colab
Caso não esteja utilizando o Jupyter, copie e cole seu código em um arquivo ASCII (Texto) salvando com a extensão .py
Você deve salvar esse notebook com sua solução para as questões da prova em formato .pdf e fazer upload no Moodle
Os arquivos devem ser nomeados com seu nome e sobrenome, sem espaços. Exemplo: moacirponti.py e moacirponti.pdf
É OBRIGATÓRIO conter no cabeçalho (início) do arquivo um comentário / texto com o seu nome completo
Desejamos uma boa prova!
#1 Constelação de Fatos da BI Solutions
A aplicação de data warehousing da BI Solutions utiliza como base uma contelação de fatos, conforme descrita a seguir.
Tabelas de dimensão
data (dataPK, dataCompleta, dataDia, dataMes, dataBimestre, dataTrimestre, dataSemestre, dataAno)
funcionario (funcPK, funcMatricula, funcNome, funcSexo, funcDataNascimento, funcDiaNascimento, funcMesNascimento, funcAnoNascimento, funcCidade, funcEstadoNome, funcEstadoSigla, funcRegiaoNome, funcRegiaoSigla, funcPaisNome, funcPaisSigla)
equipe (equipePK, equipeNome, filialNome, filialCidade, filialEstadoNome, filialEstadoSigla, filialRegiaoNome, filialRegiaoSigla, filialPaisNome, filialPaisSigla)
cargo (cargoPK, cargoNome, cargoRegimeTrabalho, cargoEscolaridadeMinima, cargoNivel)
cliente (clientePK, clienteNomeFantasia, clienteSetor, clienteCidade, clienteEstadoNome, clienteEstadoSigla, clienteRegiaoNome, clienteRegiaoSigla, clientePaisNome, clientePaisSigla)
Tabelas de fatos
pagamento (dataPK, funcPK, equipePK, cargoPK, salario, quantidadeLancamentos)
negociacao (dataPK, equipePK, clientePK, receita, quantidadeNegociacoes)
#2 Configurações
2.1 Obtenção dos Dados da BI Solutions#
#instalando o módulo wget
%%capture
!pip install -q wget
!mkdir data
#baixando os dados das tabelas de dimensão e das tabelas de fatos
import wget
url = "https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/data.csv"
wget.download(url, "data/data.csv")
url = "https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/funcionario.csv"
wget.download(url, "data/funcionario.csv")
url = "https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/equipe.csv"
wget.download(url, "data/equipe.csv")
url = "https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/cargo.csv"
wget.download(url, "data/cargo.csv")
url = "https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/cliente.csv"
wget.download(url, "data/cliente.csv")
url = "https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/pagamento.csv"
wget.download(url, "data/pagamento.csv")
url = "https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/negociacao.csv"
wget.download(url, "data/negociacao.csv")
2.2 Instalações e Inicializações#
#instalando Java Runtime Environment (JRE) versão 8
%%capture
!apt-get remove openjdk*
!apt-get update --fix-missing
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#baixando Apache Spark versão 3.0.0
%%capture
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
!tar xf spark-3.0.0-bin-hadoop2.7.tgz && rm spark-3.0.0-bin-hadoop2.7.tgz
import os
#configurando a variável de ambiente JAVA_HOME
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
#configurando a variável de ambiente SPARK_HOME
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop2.7"
%%capture
#instalando o pacote findspark
!pip install -q findspark==1.4.2
#instalando o pacote pyspark
!pip install -q pyspark==3.0.0
2.3 Bibliotecas#
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark-notebook").master("local[*]").getOrCreate()
from pyspark.sql.types import IntegerType
from pyspark.sql.types import FloatType
from pyspark.sql.functions import round, desc
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from numpy.random import seed
from tensorflow.random import set_seed
from tensorflow import keras
from tensorflow.keras import layers
2.4 Geração dos DataFrames em Pandas da BI Solutions#
Nesta seção são gerados os DataFrames em Pandas. Atenção aos nomes desses DataFrames.
pd.set_option('display.float_format', lambda x: '%.2f' % x)
cargoPandas = pd.read_csv('https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/cargo.csv')
clientePandas = pd.read_csv('https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/cliente.csv')
dataPandas = pd.read_csv('https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/data.csv')
equipePandas = pd.read_csv('https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/equipe.csv')
funcionarioPandas = pd.read_csv('https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/funcionario.csv')
negociacaoPandas = pd.read_csv('https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/negociacao.csv')
pagamentoPandas = pd.read_csv('https://raw.githubusercontent.com/cristinaaguiar/DataMartBISolutions/main/pagamento.csv')
2.5 Geração dos DataFrames em Spark da BI Solutions#
Nesta seção são gerados dos DataFrames em Spark. Atenção aos nomes desses DataFrames.
#criando os DataFrames em Spark
cargo = spark.read.csv(path="data/cargo.csv", header=True, sep=",")
cliente = spark.read.csv(path="data/cliente.csv", header=True, sep=",")
data = spark.read.csv(path="data/data.csv", header=True, sep=",")
equipe = spark.read.csv(path="data/equipe.csv", header=True, sep=",")
funcionario = spark.read.csv(path="data/funcionario.csv", header=True, sep=",")
negociacao = spark.read.csv(path="data/negociacao.csv", header=True, sep=",")
pagamento = spark.read.csv(path="data/pagamento.csv", header=True, sep=",")
#convertendo os dados necessários para o tipo de dado inteiro
colunas_cargo = ["cargoPK"]
colunas_cliente = ["clientePK"]
colunas_data = ["dataPK", "dataDia", "dataMes", "dataBimestre", "dataTrimestre", "dataSemestre", "dataAno"]
colunas_equipe = ["equipePK"]
colunas_funcionario = ["funcPK", "funcDiaNascimento", "funcMesNascimento", "funcAnoNascimento"]
colunas_negociacao = ["equipePK", "clientePK", "dataPK", "quantidadeNegociacoes"]
colunas_pagamento = ["funcPK", "equipePK", "dataPK", "cargoPK", "quantidadeLancamentos"]
for coluna in colunas_cargo:
cargo = cargo.withColumn(coluna, cargo[coluna].cast(IntegerType()))
for coluna in colunas_cliente:
cliente = cliente.withColumn(coluna, cliente[coluna].cast(IntegerType()))
for coluna in colunas_data:
data = data.withColumn(coluna, data[coluna].cast(IntegerType()))
for coluna in colunas_equipe:
equipe = equipe.withColumn(coluna, equipe[coluna].cast(IntegerType()))
for coluna in colunas_funcionario:
funcionario = funcionario.withColumn(coluna, funcionario[coluna].cast(IntegerType()))
for coluna in colunas_negociacao:
negociacao = negociacao.withColumn(coluna, negociacao[coluna].cast(IntegerType()))
for coluna in colunas_pagamento:
pagamento = pagamento.withColumn(coluna, pagamento[coluna].cast(IntegerType()))
#convertendo os dados necessários para o tipo de dado float
colunas_negociacao = ["receita"]
colunas_pagamento = ["salario"]
for coluna in colunas_negociacao:
negociacao = negociacao.withColumn(coluna, negociacao[coluna].cast(FloatType()))
for coluna in colunas_pagamento:
pagamento = pagamento.withColumn(coluna, pagamento[coluna].cast(FloatType()))
#criando as visões temporárias
cargo.createOrReplaceTempView("cargo")
cliente.createOrReplaceTempView("cliente")
data.createOrReplaceTempView("data")
equipe.createOrReplaceTempView("equipe")
funcionario.createOrReplaceTempView("funcionario")
negociacao.createOrReplaceTempView("negociacao")
pagamento.createOrReplaceTempView("pagamento")
3 Exemplo de Consulta OLAP#
Liste, para cada nome de equipe e filial da equipe, a soma das receitas recebidas no ano de 2020.
Devem ser exibidas as colunas na ordem e com os nomes especificados a seguir: “NOMEEQUIPE”, “NOMEFILIAL”, “TOTALRECEITA”. Ordene as linhas exibidas primeiro pelo nome da equipe e depois pelo nome da filial, todos em ordem ascendente. Liste as primeiras 25 linhas da resposta, sem truncamento das strings.
# Resposta da consulta OLAP usando a linguagem SQL
query = """
SELECT equipeNome AS NOMEEQUIPE,
filialNome AS NOMEFILIAL,
ROUND(SUM(receita), 2) As `TOTALRECEITA`
FROM negociacao JOIN equipe ON equipe.equipePK = negociacao.equipePK
JOIN data ON data.dataPK = negociacao.dataPK
WHERE dataAno = 2020
GROUP BY equipeNome, filialNome
ORDER BY equipeNome, filialNome
"""
spark.sql(query).show(25,truncate=False)
+--------------+--------------------------------+-------------+
|NOMEEQUIPE |NOMEFILIAL |TOTALRECEITA |
+--------------+--------------------------------+-------------+
|APP - DESKTOP |RIO DE JANEIRO - BARRA DA TIJUCA|2290441.3 |
|APP - DESKTOP |SAO PAULO - AV. PAULISTA |2146181.24 |
|APP - MOBILE |CAMPO GRANDE - CENTRO |1347929.7 |
|APP - MOBILE |RIO DE JANEIRO - BARRA DA TIJUCA|1289305.0 |
|APP - MOBILE |SAO PAULO - AV. PAULISTA |1243670.55 |
|BI & ANALYTICS|RECIFE - CENTRO |8791572.87 |
|BI & ANALYTICS|SAO PAULO - AV. PAULISTA |1.073077352E7|
|WEB |CAMPO GRANDE - CENTRO |1017644.06 |
|WEB |RIO DE JANEIRO - BARRA DA TIJUCA|612673.9 |
|WEB |SAO PAULO - AV. PAULISTA |751983.74 |
+--------------+--------------------------------+-------------+
# Resposta da consulta OLAP usando Pandas
mergeNegEq = negociacaoPandas.merge(equipePandas, on="equipePK")
mergeNegEqData = mergeNegEq.merge(dataPandas, on="dataPK")
mergeNegEqDataFiltrado = mergeNegEqData.query('dataAno == 2020')
df = mergeNegEqDataFiltrado.groupby(["equipeNome", "filialNome"], as_index=False)["receita"].sum().round(2)
df = df.sort_values(by=["equipeNome", "filialNome"], ascending=True)
df = df.rename(columns={"equipeNome":"NOMEEQUIPE", "filialNome": "NOMEFILIAL", "receita":"TOTALRECEITA"})
display(df.head(25))
| NOMEEQUIPE | NOMEFILIAL | TOTALRECEITA | |
|---|---|---|---|
| 0 | APP - DESKTOP | RIO DE JANEIRO - BARRA DA TIJUCA | 2290441.30 |
| 1 | APP - DESKTOP | SAO PAULO - AV. PAULISTA | 2146181.25 |
| 2 | APP - MOBILE | CAMPO GRANDE - CENTRO | 1347929.70 |
| 3 | APP - MOBILE | RIO DE JANEIRO - BARRA DA TIJUCA | 1289305.00 |
| 4 | APP - MOBILE | SAO PAULO - AV. PAULISTA | 1243670.55 |
| 5 | BI & ANALYTICS | RECIFE - CENTRO | 8791572.90 |
| 6 | BI & ANALYTICS | SAO PAULO - AV. PAULISTA | 10730773.55 |
| 7 | WEB | CAMPO GRANDE - CENTRO | 1017644.05 |
| 8 | WEB | RIO DE JANEIRO - BARRA DA TIJUCA | 612673.90 |
| 9 | WEB | SAO PAULO - AV. PAULISTA | 751983.75 |
# Resposta da consulta OLAP usando pyspark
dataFramepyspark = negociacao\
.join(equipe, on='equipePK')\
.join(data, on='dataPK')\
.where('dataAno = 2020')\
.groupBy('equipeNome', 'filialNome')\
.sum('receita')\
.withColumn('sum(receita)', round('sum(receita)', 2))\
.withColumnRenamed('equipeNome', 'NOMEEQUIPE')\
.withColumnRenamed('filialNome', 'NOMEFILIAL')\
.withColumnRenamed('sum(receita)', 'TOTALRECEITA')\
.orderBy ('NOMEEQUIPE', 'NOMEFILIAL')
dataFramepyspark.show(25,truncate=False)
+--------------+--------------------------------+-------------+
|NOMEEQUIPE |NOMEFILIAL |TOTALRECEITA |
+--------------+--------------------------------+-------------+
|APP - DESKTOP |RIO DE JANEIRO - BARRA DA TIJUCA|2290441.3 |
|APP - DESKTOP |SAO PAULO - AV. PAULISTA |2146181.24 |
|APP - MOBILE |CAMPO GRANDE - CENTRO |1347929.7 |
|APP - MOBILE |RIO DE JANEIRO - BARRA DA TIJUCA|1289305.0 |
|APP - MOBILE |SAO PAULO - AV. PAULISTA |1243670.55 |
|BI & ANALYTICS|RECIFE - CENTRO |8791572.87 |
|BI & ANALYTICS|SAO PAULO - AV. PAULISTA |1.073077352E7|
|WEB |CAMPO GRANDE - CENTRO |1017644.06 |
|WEB |RIO DE JANEIRO - BARRA DA TIJUCA|612673.9 |
|WEB |SAO PAULO - AV. PAULISTA |751983.74 |
+--------------+--------------------------------+-------------+
# Se a sua consulta foi especificada usando a linguagem SQL
# Transforme o resultado da consulta em um dataFrame em Pandas
# descomentando o seguinte comando:
#df = spark.sql(query).toPandas()
# Se a sua consulta foi especificada usando os comandos de pyskpark
# Transforme o resultado da consulta em um dataFrame em Pandas
# descomentado o seguinte comando:
df = dataFramepyspark.toPandas()
# Exibindo algumas linhas do dataFrame gerado
df
| NOMEEQUIPE | NOMEFILIAL | TOTALRECEITA | |
|---|---|---|---|
| 0 | APP - DESKTOP | RIO DE JANEIRO - BARRA DA TIJUCA | 2290441.30 |
| 1 | APP - DESKTOP | SAO PAULO - AV. PAULISTA | 2146181.24 |
| 2 | APP - MOBILE | CAMPO GRANDE - CENTRO | 1347929.70 |
| 3 | APP - MOBILE | RIO DE JANEIRO - BARRA DA TIJUCA | 1289305.00 |
| 4 | APP - MOBILE | SAO PAULO - AV. PAULISTA | 1243670.55 |
| 5 | BI & ANALYTICS | RECIFE - CENTRO | 8791572.87 |
| 6 | BI & ANALYTICS | SAO PAULO - AV. PAULISTA | 10730773.52 |
| 7 | WEB | CAMPO GRANDE - CENTRO | 1017644.06 |
| 8 | WEB | RIO DE JANEIRO - BARRA DA TIJUCA | 612673.90 |
| 9 | WEB | SAO PAULO - AV. PAULISTA | 751983.74 |
5 - REDES NEURAIS - RNAP#
Revisão#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow import keras
import tensorflow as tf
from sklearn.model_selection import train_test_split
df = pd.read_csv("smartphone_activity_dataset.csv")
print(df.shape)
df.head(8)
2024-01-08 21:01:19.298659: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-08 21:01:19.712673: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-08 21:01:19.715225: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-08 21:01:22.103870: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(10299, 562)
| feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | feature_9 | feature_10 | ... | feature_553 | feature_554 | feature_555 | feature_556 | feature_557 | feature_558 | feature_559 | feature_560 | feature_561 | activity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.289 | -0.0203 | -0.133 | -0.995 | -0.983 | -0.914 | -0.995 | -0.983 | -0.924 | -0.935 | ... | -0.2990 | -0.710 | -0.1130 | 0.03040 | -0.4650 | -0.0184 | -0.841 | 0.180 | -0.0586 | 5 |
| 1 | 0.278 | -0.0164 | -0.124 | -0.998 | -0.975 | -0.960 | -0.999 | -0.975 | -0.958 | -0.943 | ... | -0.5950 | -0.861 | 0.0535 | -0.00743 | -0.7330 | 0.7040 | -0.845 | 0.180 | -0.0543 | 5 |
| 2 | 0.280 | -0.0195 | -0.113 | -0.995 | -0.967 | -0.979 | -0.997 | -0.964 | -0.977 | -0.939 | ... | -0.3910 | -0.760 | -0.1190 | 0.17800 | 0.1010 | 0.8090 | -0.849 | 0.181 | -0.0491 | 5 |
| 3 | 0.279 | -0.0262 | -0.123 | -0.996 | -0.983 | -0.991 | -0.997 | -0.983 | -0.989 | -0.939 | ... | -0.1170 | -0.483 | -0.0368 | -0.01290 | 0.6400 | -0.4850 | -0.849 | 0.182 | -0.0477 | 5 |
| 4 | 0.277 | -0.0166 | -0.115 | -0.998 | -0.981 | -0.990 | -0.998 | -0.980 | -0.990 | -0.942 | ... | -0.3510 | -0.699 | 0.1230 | 0.12300 | 0.6940 | -0.6160 | -0.848 | 0.185 | -0.0439 | 5 |
| 5 | 0.277 | -0.0101 | -0.105 | -0.997 | -0.990 | -0.995 | -0.998 | -0.990 | -0.996 | -0.942 | ... | -0.5450 | -0.845 | 0.0826 | -0.14300 | 0.2750 | -0.3680 | -0.850 | 0.185 | -0.0421 | 5 |
| 6 | 0.279 | -0.0196 | -0.110 | -0.997 | -0.967 | -0.983 | -0.997 | -0.966 | -0.983 | -0.941 | ... | -0.2170 | -0.564 | -0.2130 | -0.23100 | 0.0146 | -0.1900 | -0.852 | 0.182 | -0.0430 | 5 |
| 7 | 0.277 | -0.0305 | -0.125 | -0.997 | -0.967 | -0.982 | -0.996 | -0.966 | -0.983 | -0.941 | ... | -0.0823 | -0.422 | -0.0209 | 0.59400 | -0.5620 | 0.4670 | -0.851 | 0.184 | -0.0420 | 5 |
8 rows × 562 columns
Problema: dados de sensores e atividades como alvo#
df['activity'].value_counts().plot(kind = 'bar')
<Axes: xlabel='activity'>
df['activity'].value_counts()
activity
6 1944
5 1906
4 1777
1 1722
2 1544
3 1406
Name: count, dtype: int64
1) Preparando dados e montando conjuntos de treinamento e teste#
converter o dataframe para numpy array e depois separar as features (entrada) dos alvos (saída)
sum(df.iloc[:,:-1].duplicated())
0
nparray = df.to_numpy()
nparray
array([[ 0.289 , -0.0203, -0.133 , ..., 0.18 , -0.0586, 5. ],
[ 0.278 , -0.0164, -0.124 , ..., 0.18 , -0.0543, 5. ],
[ 0.28 , -0.0195, -0.113 , ..., 0.181 , -0.0491, 5. ],
...,
[ 0.35 , 0.0301, -0.116 , ..., 0.274 , 0.181 , 2. ],
[ 0.238 , 0.0185, -0.0965, ..., 0.265 , 0.188 , 2. ],
[ 0.154 , -0.0184, -0.137 , ..., 0.264 , 0.188 , 2. ]])
features = (nparray[:,:-1]).astype(float)
targets = (nparray[:,-1]).astype(int)
print("Features = ", features.shape)
print("Targets = ", targets.shape)
Features = (10299, 561)
Targets = (10299,)
# verificando as classes do problema
np.unique(targets)
array([1, 2, 3, 4, 5, 6])
print(np.unique(targets).shape[0])
6
Split aleatório de 80% para treinamento
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=0)
print("Exemplos de treinamento:", len(X_train))
print("Exemplos de teste:", len(X_test))
Exemplos de treinamento: 8239
Exemplos de teste: 2060
print(f'Numero de features {X_train.shape[1]}')
Numero de features 561
2) Projetando/montando a rede neural#
def deep_net1(input_dim, n_neurons=10, output_dim=6, output_activation='relu', dropout_rate=0.0):
#camada de entrada
input_data = keras.layers.Input(shape=(input_dim,))
#camadas intermediarias
x = keras.layers.Dense(n_neurons, activation='linear')(input_data)
x = keras.layers.Dense(n_neurons, activation='relu')(x)
x = keras.layers.Dense(n_neurons, activation='relu')(x)
x = keras.layers.Dropout(dropout_rate)(x)
# camada de saída
output = keras.layers.Dense(output_dim, activation=output_activation)(x)
# modelo
dnn = keras.models.Model(input_data, output)
return dnn
## instanciar e mostrar
dnn_1 = deep_net1(input_dim=100)
dnn_1.summary()
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 100)] 0
dense_4 (Dense) (None, 10) 1010
dense_5 (Dense) (None, 10) 110
dense_6 (Dense) (None, 10) 110
dense_7 (Dense) (None, 6) 66
=================================================================
Total params: 1296 (5.06 KB)
Trainable params: 1296 (5.06 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Tarefas com redes neurais:
Classificacao Binária 1 neuronio: sigmoide [0,1]
Classificacao Multiclasse (>2) 1 neuronio por classe: softmax
Regressao 1 neuronio por valor a ser feita a regressao (variavel resposta): relu ou sigmoide
uma boa opção é: normalizo entre [0,1] a variavel resposta, usar relu e a resposta converto de volta para o intervalo original
3) Treinamento: configurar/compilar e ajustar parâmetros do modelo#
Compilação:
Otimizador
Taxa de aprendizado inicial
Decaimento da taxa de aprendizado
Função de custo/perda
Métricas de avaliação
Vamos usar uma rede de regressão para predizer valores entre 1 e 6
dn1 = deep_net1(input_dim=X_train.shape[1], n_neurons=256,
output_dim=1, output_activation='relu')
dn1.summary()
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) [(None, 561)] 0
dense_8 (Dense) (None, 256) 143872
dense_9 (Dense) (None, 256) 65792
dense_10 (Dense) (None, 256) 65792
dense_11 (Dense) (None, 1) 257
=================================================================
Total params: 275713 (1.05 MB)
Trainable params: 275713 (1.05 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
dn1.compile(optimizer=keras.optimizers.Adam(), loss='mse',
metrics=['mae', 'accuracy'])
Treinamento (ajuste dos parâmetros iniciais):
Tamanho do minibatch
Número de épocas
Existência (opcional) de um conjunto de validação
batch_size = 16
epochs = 8
history1 = dn1.fit(X_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=2)
Epoch 1/8
515/515 - 5s - loss: 0.3101 - mae: 0.3969 - accuracy: 0.1666 - 5s/epoch - 9ms/step
Epoch 2/8
515/515 - 3s - loss: 0.1060 - mae: 0.2484 - accuracy: 0.1682 - 3s/epoch - 5ms/step
Epoch 3/8
515/515 - 3s - loss: 0.0848 - mae: 0.2176 - accuracy: 0.1683 - 3s/epoch - 5ms/step
Epoch 4/8
515/515 - 3s - loss: 0.0739 - mae: 0.1987 - accuracy: 0.1683 - 3s/epoch - 5ms/step
Epoch 5/8
515/515 - 3s - loss: 0.0678 - mae: 0.1896 - accuracy: 0.1683 - 3s/epoch - 5ms/step
Epoch 6/8
515/515 - 3s - loss: 0.0621 - mae: 0.1797 - accuracy: 0.1683 - 3s/epoch - 5ms/step
Epoch 7/8
515/515 - 3s - loss: 0.0674 - mae: 0.1874 - accuracy: 0.1683 - 3s/epoch - 5ms/step
Epoch 8/8
515/515 - 3s - loss: 0.0520 - mae: 0.1583 - accuracy: 0.1683 - 3s/epoch - 5ms/step
Acurácia 16%?#
plt.figure(figsize=(4,3))
plt.plot(history1.history["loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.show()
plt.figure(figsize=(4,3))
plt.plot(history1.history["accuracy"])
plt.title("model accuracy")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.show()
# Computando as métricas para o teste
score = dn1.evaluate(X_test, y_test, verbose=0)
print("MSE (Custo): ", score[0])
print("Accuracy: ", score[1])
print("MAE: ", score[2])
MSE (Custo): 0.05597051605582237
Accuracy: 0.16757960617542267
MAE: 0.1626213639974594
1/6
0.16666666666666666
y_pred = dn1.predict(X_test[:5], verbose=0)
print('real - predito')
for i in range(5):
print(y_test[i], end=' - ')
print(y_pred[i], end=' - ')
print(f'{round(y_pred[i][0],0)}')
real - predito
4 - [4.199423] - 4.0
3 - [3.0379071] - 3.0
6 - [5.7010794] - 6.0
2 - [1.9284786] - 2.0
1 - [1.1790736] - 1.0
3) Alterando para uma rede de classificação#
A última camada deve ter um neurônio para cada classe (saída do tipo one-hot-encoding)
A ativação deve ser softmax
Também precisamos alterar a função de custo
# olhando o vetor y
y_train[:3]
array([3, 4, 1])
np.unique(y_train)
array([1, 2, 3, 4, 5, 6])
num_classes = np.unique(y_train).shape[0]
print(num_classes)
6
Precisamos criar vetores do tipo one-hot
Para isso podemos utilizar to_categorical mas esse requer entrada com valores iniciando em 0 (zero)
y_train_e = keras.utils.to_categorical(y_train-1, num_classes)
y_test_e = keras.utils.to_categorical(y_test-1, num_classes)
# vetor y resultante
y_train_e[:3]
array([[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0.]], dtype=float32)
ou usar o sklearn
from sklearn.preprocessing import OneHotEncoder
y_train.shape
(8239,)
# linha = instancia/exemplo
# coluna = feature
# array tem que ser: 8239 x 1
ohe = OneHotEncoder()
ohe.fit(y_train.reshape(-1, 1))
y_train_es = ohe.transform(y_train.reshape(-1,1))
y_test_es = ohe.transform(y_test.reshape(-1,1))
y_train_es
<8239x6 sparse matrix of type '<class 'numpy.float64'>'
with 8239 stored elements in Compressed Sparse Row format>
y_train_es[:3].A
array([[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0.]])
dclass = deep_net1(input_dim=X_train.shape[1], n_neurons=256,
output_dim=6, output_activation='softmax')
dclass.summary()
Model: "model_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) [(None, 561)] 0
dense_12 (Dense) (None, 256) 143872
dense_13 (Dense) (None, 256) 65792
dense_14 (Dense) (None, 256) 65792
dense_15 (Dense) (None, 6) 1542
=================================================================
Total params: 276998 (1.06 MB)
Trainable params: 276998 (1.06 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
dclass.compile(
optimizer=keras.optimizers.Adam(),
loss="categorical_crossentropy",
metrics=['accuracy', 'mse']
)
epochs = 10
history2 = dclass.fit(X_train, y_train_e,
batch_size=batch_size,
epochs=epochs,
verbose=1,
)
Epoch 1/10
515/515 [==============================] - 5s 6ms/step - loss: 0.3499 - accuracy: 0.8545 - mse: 0.0340
Epoch 2/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1745 - accuracy: 0.9281 - mse: 0.0171
Epoch 3/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1480 - accuracy: 0.9420 - mse: 0.0144
Epoch 4/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1336 - accuracy: 0.9491 - mse: 0.0124
Epoch 5/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1176 - accuracy: 0.9531 - mse: 0.0112
Epoch 6/10
515/515 [==============================] - 3s 6ms/step - loss: 0.0987 - accuracy: 0.9623 - mse: 0.0093
Epoch 7/10
515/515 [==============================] - 3s 6ms/step - loss: 0.0936 - accuracy: 0.9614 - mse: 0.0092
Epoch 8/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1126 - accuracy: 0.9607 - mse: 0.0099
Epoch 9/10
515/515 [==============================] - 3s 6ms/step - loss: 0.0747 - accuracy: 0.9705 - mse: 0.0073
Epoch 10/10
515/515 [==============================] - 3s 6ms/step - loss: 0.0868 - accuracy: 0.9677 - mse: 0.0082
# obtem a predicao e arredonda para 2 casas
y_pred = np.round(dclass.predict(X_test[:8]), 2)
print('real - predito')
for i in range(8):
print(y_test[i], end=' - ')
print(y_pred[i])
1/1 [==============================] - 0s 125ms/step
real - predito
4 - [0. 0. 0. 0.87 0.13 0. ]
3 - [0. 0. 1. 0. 0. 0.]
6 - [0. 0. 0. 0. 0. 1.]
2 - [0. 1. 0. 0. 0. 0.]
1 - [1. 0. 0. 0. 0. 0.]
4 - [0. 0. 0. 1. 0. 0.]
3 - [0. 0. 1. 0. 0. 0.]
2 - [0. 1. 0. 0. 0. 0.]
y_test[1], np.argmax(y_pred[1])+1
(3, 3)
y_test[0], np.argmax(y_pred[0])+1
(4, 4)
4) Pequenas melhorias#
Inserir dropout
Decaimento de learning rate
dclass2 = deep_net1(X_train.shape[1], output_dim=6, n_neurons=256,
output_activation='softmax', dropout_rate=0.2)
def scheduler(epoch, lr):
# epoch - época atual
# lr - learning rate
# roda por 3 epocas com o LR inicial
if epoch > 4:
# decai exponencialmente com fator -0.1
#print(f'{lr:.6f}')
lr = lr * tf.math.exp(-0.1)
return lr
callbacklr = tf.keras.callbacks.LearningRateScheduler(scheduler)
dclass2.compile(
optimizer=keras.optimizers.Adam(0.0015),
loss="categorical_crossentropy",
metrics=['accuracy']
)
epochs = 10
history3 = dclass2.fit(X_train, y_train_e,
batch_size=batch_size,
epochs=epochs,
verbose=1,
callbacks=[callbacklr])
Epoch 1/10
515/515 [==============================] - 5s 6ms/step - loss: 0.4051 - accuracy: 0.8337 - lr: 0.0015
Epoch 2/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1857 - accuracy: 0.9264 - lr: 0.0015
Epoch 3/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1350 - accuracy: 0.9473 - lr: 0.0015
Epoch 4/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1230 - accuracy: 0.9541 - lr: 0.0015
Epoch 5/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1374 - accuracy: 0.9494 - lr: 0.0015
Epoch 6/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1016 - accuracy: 0.9610 - lr: 0.0014
Epoch 7/10
515/515 [==============================] - 3s 6ms/step - loss: 0.0957 - accuracy: 0.9626 - lr: 0.0012
Epoch 8/10
515/515 [==============================] - 3s 6ms/step - loss: 0.1022 - accuracy: 0.9636 - lr: 0.0011
Epoch 9/10
515/515 [==============================] - 3s 6ms/step - loss: 0.0921 - accuracy: 0.9660 - lr: 0.0010
Epoch 10/10
515/515 [==============================] - 3s 6ms/step - loss: 0.0704 - accuracy: 0.9728 - lr: 9.0980e-04
# obtem a predicao e arredonda para 2 casas
y_pred = np.round(dclass2.predict(X_test[:8]), 1)
print('real - predito')
for i in range(8):
print(y_test[i], end=' - ')
print(y_pred[i])
1/1 [==============================] - 0s 120ms/step
real - predito
4 - [0. 0. 0. 0.8 0.2 0. ]
3 - [0. 0. 1. 0. 0. 0.]
6 - [0. 0. 0. 0. 0. 1.]
2 - [0. 1. 0. 0. 0. 0.]
1 - [1. 0. 0. 0. 0. 0.]
4 - [0. 0. 0. 1. 0. 0.]
3 - [0. 0. 1. 0. 0. 0.]
2 - [0. 1. 0. 0. 0. 0.]
Extra : usando o modo funcional para fazer ramos paralelos na rede neural#
def deep_net2(input_dim1, input_dim2, n_neurons=10, output_dim=6, output_activation='relu', dropout_rate=0.0):
#camada de entrada
input_data1 = keras.layers.Input(shape=(input_dim1,))
input_data2 = keras.layers.Input(shape=(input_dim2,))
#camadas intermediarias
a = keras.layers.Dense(n_neurons, activation='relu')(input_data1)
b = keras.layers.Dense(n_neurons, activation='relu')(input_data2)
c = keras.layers.Concatenate()([a,b])
y = keras.layers.Dense(n_neurons, activation='relu')(c)
x = keras.layers.Dropout(dropout_rate)(y)
# camada de saída
output = keras.layers.Dense(output_dim, activation=output_activation)(x)
# modelo
dnn = keras.models.Model([input_data1, input_data2], output)
return dnn
dnn2 = deep_net2(10, 100)
dnn2.summary()
Model: "model_7"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_8 (InputLayer) [(None, 10)] 0 []
input_9 (InputLayer) [(None, 100)] 0 []
dense_28 (Dense) (None, 10) 110 ['input_8[0][0]']
dense_29 (Dense) (None, 10) 1010 ['input_9[0][0]']
concatenate_1 (Concatenate (None, 20) 0 ['dense_28[0][0]',
) 'dense_29[0][0]']
dense_30 (Dense) (None, 10) 210 ['concatenate_1[0][0]']
dropout_3 (Dropout) (None, 10) 0 ['dense_30[0][0]']
dense_31 (Dense) (None, 6) 66 ['dropout_3[0][0]']
==================================================================================================
Total params: 1396 (5.45 KB)
Trainable params: 1396 (5.45 KB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________